You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

19 March 2021

DRER: Deep Learning–Based Driver’s Real Emotion Recognizer

,
,
,
,
,
and
1
Graduate School of Automotive Engineering, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Korea
2
Department of Automobile and IT Convergence, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Korea
3
Chassis System Control Research Lab, Hyundai Motor Group, Hwaseong 18280, Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Deep Learning Methods for Human Activity Recognition and Emotion Detection

Abstract

In intelligent vehicles, it is essential to monitor the driver’s condition; however, recognizing the driver’s emotional state is one of the most challenging and important tasks. Most previous studies focused on facial expression recognition to monitor the driver’s emotional state. However, while driving, many factors are preventing the drivers from revealing the emotions on their faces. To address this problem, we propose a deep learning-based driver’s real emotion recognizer (DRER), which is a deep learning-based algorithm to recognize the drivers’ real emotions that cannot be completely identified based on their facial expressions. The proposed algorithm comprises of two models: (i) facial expression recognition model, which refers to the state-of-the-art convolutional neural network structure; and (ii) sensor fusion emotion recognition model, which fuses the recognized state of facial expressions with electrodermal activity, a bio-physiological signal representing electrical characteristics of the skin, in recognizing even the driver’s real emotional state. Hence, we categorized the driver’s emotion and conducted human-in-the-loop experiments to acquire the data. Experimental results show that the proposed fusing approach achieves 114% increase in accuracy compared to using only the facial expressions and 146% increase in accuracy compare to using only the electrodermal activity. In conclusion, our proposed method achieves 86.8% recognition accuracy in recognizing the driver’s induced emotion while driving situation.

1. Introduction

Drivers’ emotional state affects their ability to drive [1,2]. As vehicles become more intelligent, it becomes increasingly important to recognize the driver’s emotions. Accurately detecting the driver’s emotional state allows the vehicle to respond more quickly to the driver’s emotional needs; it can provide adequate infotainment support and adjusts vehicle dynamics for a safer and more comfortable ride. In intelligent vehicles, recognizing the driver’s emotion is emphasized because vehicles can select the options according to driver’s emotional state (e.g., driving mode, song to change the atmosphere and driving by themselves).
In the human–machine interface, facial expressions are considered important because they are useful for revealing emotions between people. These methods based on facial expressions have been established as a research field called facial expression recognition (FER). With the great development of deep learning-based image recognition technology, deep learning is more utilized for FER [3,4,5,6,7,8]. However, facial expressions cannot always reveal human’s real emotions due to various factors. Particularly, this characteristic is even more significant in case of drivers. For instance, when a driver frowns while driving, it may be tempting to assume that the driver is currently in an unpleasant state if the judgment is made purely based on the driver’s facial expressions. However, if it is simply the reaction of the driver’s facial muscles to the stimulus of sunlight, then the driver should not be judged to be in an unpleasant state. Therefore, it is not always the driver’s emotions that appear on their facial expressions. Consequently, we aim to recognize the driver’s real emotion even in situations in which the real emotion is not fully revealed via facial expressions while driving. There is a similar research field named micro facial expressions, in which micro changes in expressions within a very short duration are studied. Such micro changes normally occur when the real emotions are concealed deliberately or unconsciously. Some research focusing on facial micro expressions has realized promising methods for detecting concealed emotions [9,10,11,12]. A micro expression can be a clue to the driver’s real emotions, but the lack of samples per category and the imbalanced distribution of samples are the primary obstacles associated with its usage in deep learning-based algorithms [13]. Ultimately, the driver’s real emotion that we aim for is not concealed emotion but emotion that is not fully revealed. Furthermore, most research uses bio-physiology signals for recognizing human emotions [14,15,16,17,18,19,20,21,22,23]. The most commonly used bio-physiology signals are electroencephalogram (EEG), electrocardiogram (ECG), photoplethysmography (PPG) and electrodermal activity (EDA). Moreover, some studies using both facial expressions and bio-physiology signals achieved high accuracy in the case of emotion recognition [24,25] and performed well in recognizing various emotion classes [26,27]. All these studies are based on deep learning algorithms. Based on the above trends, we propose a deep learning-based driver’s real emotion recognizer (DRER) to recognize the driver’s real emotional state while driving based on the sensor fusion of the driver’s FER and bio-physiology data. The proposed method is divided into two steps.
The first step is the FER—recognizing the driver’s facial expressions while driving. We propose a FER model constructed with reference to several state-of-the-art convolutional neural networks (CNNs), such as VGGNet [28], ResNet [29], ResNeXt [30] and SENet [31]. The proposed FER model is an end-to-end architecture; thus, the model receives a whole image of the driver’s face and outputs its recognized facial expression state. The FER model recognizes the driver’s facial expression state using continuous representations, valence and arousal, which are the most popular emotional continuous representations proposed by Russell [32]. Among several databases [33,34,35,36,37], we trained the FER model using the AffectNet [33], which has more than 1M facial images and annotated valence and arousal. The model adding SENet to ResNeXt networks obtains the same level of accuracy as the baseline proposed by Mollahosseini et al. [33].
The second step is the sensor fusion emotion recognition (SFER)—recognizing the driver’s real emotions by fusing the recognized state of facial expressions with the driver’s bio-physiology signals. On the basis of the deep neural network (DNN), we propose the SFER model to receive the driver’s recognized facial expression state (represented by the valence and arousal) and EDA signals of the driver and output the driver’s recognized real emotional state. The driver’s recognized real emotional state is represented among the several discrete categories. To avoid confusion among the classified emotions, we categorize the emotions according to the driver’s real emotion while driving: neutral, happy, excited, fearful, angry, depressed, bored and relieved. Motivated by the training and experimental SFER model, we need a dataset that contains the driver’s real emotions (represented by the aforementioned emotion categories). Hence, we conducted a human-in-the-loop simulation to obtain the dataset. Thirteen volunteers participated, in which they drove a full-scale driving simulator with each emotion induced. We measured the driver’s facial image and EDA during the simulation and split the measured data into training, validation and test sets through average filtering. We trained the SFER model using the training and validation set from the human-in-the-loop simulation.
In this study, we obtained remarkably consistent results with respect to the sensor fusion of the driver’s facial expressions and EDA data. When the driver’s emotions are recognized by EDA data alone, it has an accuracy of 33.1–35.8%, whereas, when it is recognized only by the results of FER, it has an accuracy of 37.6–41.1%. When the driver’s emotions are recognized by combining the two aforementioned parameters, the accuracy is 65.8–88.0%. Through an advancement of the algorithm, the proposed DRER shows the highest accuracy of 88.8%.
Our main contributions are as follows. First, a deep learning-based intermediate sensor fusion algorithm is proposed. There are several sensor fusion strategies: early fusion, late fusion and intermediate fusion. Early fusion, also known as data-level fusion, is a traditional fusion strategy that fuses data before they are analyzed. Late fusion, also known as decision-level fusion, is a fusion strategy that fuses decisions made using each individual sensor data, and it is simpler than early fusion when the sampling time, unit and dimensionality of data are different. Intermediate fusion, also known as feature-level fusion, is the most flexible strategy that fuses extracted higher level features and allows them to be the fusion stages of model training. For our model, the FER model extracts higher level features from facial images. Then, the SFER model fuses the extracted features and EDA to recognize emotional states. This is the reason that our model is referred to as an intermediate sensor fusion algorithm. The proposed algorithm recognizes the driver’s real emotion state among the newly defined eight emotional categories suitable for a driver by only fusing two sensors, i.e., the facial camera and EDA sensor, which can be easily collected from a vehicle. Second, the experiment results show a higher accuracy for recognition of the driver’s real emotion when fusing the facial camera and EDA sensor data than when using each datum individually. The proposed algorithm was evaluated with the data collected through a full-scale driving simulator that is similar to an actual vehicle environment. As a result, a recognition accuracy of 86.8% was achieved, which is greater than those obtained using only the facial camera and an EDA sensor by 114% and 146%, respectively. Finally, when compared with other state-of-the-art algorithms, the prediction accuracy was the highest despite having the largest number of classification classes.
The rest of this paper is organized as follows. Section 2 introduces the emotion recognition’s related work. Section 3 discusses the proposed DRER for recognizing the driver’s real emotion with the driver’s facial expressions and EDA data while driving. Section 4 provides the details of the proposed model. Section 5 presents the extensive experiments and human-in-the-loop simulation. Section 6 compares and analyzes the experiment results. Section 7 concludes this work and describes further work.

3. Proposed Work

We propose deep learning-based algorithms to monitor the driver’s emotional state even when emotions are not fully revealed by facial expressions while driving. Our emotion recognition system that monitors the driver’s real emotions is called DRER. We propose the two main steps to recognize the driver’s real emotion: recognizing the driver’s facial expression state and fusing bio-physiological signals with the recognized facial expression state. Figure 1 shows the proposed steps. Table 1 summarizes the terminologies used in this paper. The following subsections describe the detailed task of the individual steps.
Figure 1. Overview of the proposed work with two major steps: FER and SFER.
Table 1. Terminologies and definitions of the variable used in this paper.

3.1. Facial Expression Recognition

The most common way to express emotions is facial expressions. Therefore, the first step for the proposed work is recognizing the driver’s facial expressions. While driving, the driver’s face is recorded by a camera, and the face video becomes the input of the FER model. The FER model consists of CNN and outputs the recognized state of facial expressions in continuous represented values called valence and arousal. These representation methods can emphasize that the state of facial expressions appears continuously and includes intensity. Moreover, these mathematical representations are easy to fuse with various measured signals. Through these fusion approaches, even the emotions beyond the face can be captured.

3.2. Sensor Fusion Emotion Recognition

Our second step is the fusion with driver’s bio-physiological signals. As mentioned in Section 1, even if the driver’s facial expressions are detected through the FER model, they cannot be always regarded as the correct driver’s emotional states. Hence, we propose the SFER model, which consists of DNN, to recognize even the real emotions beyond the driver’s facial expressions by fusing the bio-physiological signals related to the body regulation and affected by the emotions with the output values of the FER model. The driver’s recognized emotional states are represented among several discrete categories. Section 4 describes the detailed fusion algorithms’ methodology to recognize even the driver’s real emotional states through the facial and bio-physiological information.

4. Methods

This section describes the detailed methodology of our algorithm for each model.

4.1. Facial Expression Recognition

On the basis of the deep learning algorithm, we applied end-to-end architectures to receive images ( I ) , including the driver’s face, and output the continuous two-dimensional index, called valence ( V ^ ) and arousal ( A ^ ) . This FER model is based on the hypothesis that the images ( I ) , including the driver’s face, are continuously provided. To meet this hypothesis, we thoughtfully set the camera for recording the driver’s face. The details are discussed in Section 5.2.
Several preprocessing steps (e.g., resizing, normalization, detecting ROI and detecting points or movements) are required to recognize facial expressions from images [54]. However, we only have two preprocessing steps, resizing and normalizing. They are relatively simple preprocessing processes because the deep learning algorithm that we use automatically finds the region of interest and extracts features. Thus, we refer to our FER model as an end-to-end architecture. Resizing is an operation to equalize the input image’s width, height and depth with input shape ( w , h , d ) of the proposed deep learning algorithm. Each image pixels need to be scaled in the range from 0 to 1. This process is called normalization, which is a common technique for preprocessing for deep learning. It involves changing the values of numeric pixels in an image to a common scale without distorting the differences in the ranges of values. The equation of normalization is as follows:
x = x X m i n X m a x X m i n
where x is the pixel value of an image, x is the normalized value of the image, X m a x is the maximum pixel value of the image and X m i n is the minimum pixel value of the image. Usually, each pixel can be simply divided by 255 because most images consist of 0–255 values.
To build the FER model based on deep learning network, we benchmarked the state-of-the-art CNN models from ImageNet large-scale visual recognition challenge (ILSVRC) [55]. ILSVRC is an annual object detection and image classification competition that uses subsets from the ImageNet (a large-scale hierarchical image database) [56].
The first model to be introduced is VGGNet [28], the 2014 ILSVRC runner-up. Although VGGNet is the runner-up, the gap between the winner is insignificant, and the structure is relatively easier to understand than the winner. Hence, we started exploring VGGNet as the underlying network. VGGNet uses multiple CNN layers to find the image’s features while reducing the width and height and deepening the input image’s depth. Figure 2a illustrates VGGNet’s structure, including the multiple CNN layers as blue boxes. In the figure, each time it passes through the green box, the max-pooling layer, it decreases width and height and increases the depth.
Figure 2. (a) VGGNet with vanilla CNN (blue) and max-pooling (green). (b) ResNet with vanilla CNN (blue), max-pooling (green), and shortcut connection (orange). (c) ResNeXt with vanilla CNN (blue), max-pooling (green) and shortcut connection (orange). (d) SE-ResNet with vanilla CNN (blue), max-pooling (green), shortcut connection (orange) and SE block (yellow).
We also benchmarked ResNet [29], the 2015 ILSVRC winner, to build deep learning models that are deeper than VGGNet. ResNet can construct a deeper model without worrying about vanishing gradient problem by applying a shortcut connection that connects the input value to the output value for every of the two CNN layers. Its overall structure is almost similar to that of VGGNet, as shown in Figure 2b. The shortcut connections are illustrated as black arrows and the values before and after the CNN layers are summed at the orange box. The values before the CNN layers are continuously summed; thus, the amount of information to be trained is reduced. Hence ResNet can be designed deeper than VGGNet.
ResNet is a structure to make the neural network deeper, and we refer to ResNeXt [30] to search for a wider model. ResNeXt [30], the 2016 ILSVRC runner-up, adds splitting operation between the shortcut connection by applying a new dimension called a cardinality based on ResNet. ResNeXt’s splitting operation is illustrated in Figure 2c. Its overall structure is similar to that of ResNet, but every second CNN layer between shortcut connections is divided into as many boxes as the number of the cardinality based on the depth direction.
We applied the SE block proposed by Roy et al. [57]. The SE block inserted in the middle of the model corrects the weights between the feature map’s channels that is the intermediate result. In Figure 3, the feature maps entered as the SE blocks are converted into values representing each channel through global average pooling. These representative values are recalibrated as it passes through the bottleneck structure formed by the reduction ratio r. The recalibrated values are applied to the existing feature map element-wise. SENet, which applies the SE block to the existing model, is the 2017 ILSVRC winner. We applied the SE block to the right before the summation layers of ResNet and ResNeXt. The model with the SE block applied to ResNet is illustrated in Figure 2d, and the SE blocks are shown as yellow boxes before the orange boxes.
Figure 3. Illustration of the SE block; different colors represent each different channel.
Every backbone network introduced above has 1000 units of final fully connected hidden layers to classify the 1000 different object categories; in our FER model, we modified the final hidden layer to have two units of a fully connected layer to output the recognized state of facial expressions in two continuous represented values, valence and arousal. Then, the tanh function transforms a vector of two output values of the fully connected layer into recognized valence and arousal values. The tanh turns them into values between −1 and 1. The absolute value of the transformed valence represents the degree, where a positive value represents the degree of attractiveness and a negative value represents the degree of aversiveness. The numerical size of the transformed arousal value represents the perceived intensity. For instance, the negative arousal values have lower perceived strength than the positive arousal values. The reason that the output of the FER model is represented in the continuous values is discussed in Section 4.2. Algorithm 1 summarizes detailed procedures of FER.
Algorithm 1 Our Facial Expression Recognition (FER) Algorithm.
Require: The face image of driver at time t: I t
Ensure: The recognized valence and arousal level of driver by facial expressions at time t: ( V ^ t , A ^ t )
   Initialize:
      Let the facial expression recognition model be FER_Model
      Let FER_Model be one of VGGNet, ResNet, ResNeXt, SE-ResNet, and SE-ResNeXt
      Let final hidden layer of FER_Model be fully connected layer which have 2 units and tanh function as activation function
   while Driving do
      # Resizing
      Resize I t Z w × h × d
      # Normalization
       I t = I t m i n ( I t ) m a x ( I t ) m i n ( I t )
       ( V ^ t , A ^ t ) = FER_Model( I t )
   end while

4.2. Sensor Fusion Emotion Recognition

To classify the driver’s real emotions ( Y ^ ) that are not fully revealed on the face, we applied a DNN-based supervised learning architecture that fused our output values of the FER model and bio-physiological signals to output the most confident emotion among several discrete categories. Although these discrete representations allowed simplification and easy understanding of the recognized emotion, confirmatory biases and priming effects depend on the individual’s experience [58,59]. Hence, it is important to define the appropriate driver’s emotion category ( C ) for simplified emotion that gives accurate intuition.
Here, we used the arousal–valence model proposed by Russell as the basic frame. By considering the nine emotions that considered the driving context [51] and the emotions extracted from the AffectNet results, we investigated the following eight emotions: neutral, happy, excited, fearful, angry, depressed, bored and relieved. Arousal and valence are both 0 for neutral and both positive for happy and excited. Arousal is positive but valence is negative for fearful and angry. Arousal and valence are both negative for depressed and bored. Valence is positive and arousal is negative for relieved. Figure 4 shows the correlations between the valence and arousal of all eight emotions. Through the aforementioned process, we set the number of emotion categories (k) to eight and defined the emotional state categories for the driver (C), as shown in Equation (2).
C = C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 = Neutral Happy Excited Fearful Angry Depressed Bored Relieved
Figure 4. The correlations between the valence and arousal of defined eight emotions.
As mentioned in Section 2.2 and Section 2.3, there are various bio-physiological signals to recognize human emotion, but we select some representative signals in consideration of the vehicle environment and industrial demand. Although an EEG is widely used, expensive medical equipment is required for its measurement. Some of the popular signals are cardiac signals (e.g., ECG or PPG). However, because noise caused by the movement of a subject is easily generated during the measurement of the cardiac signals [17], they are not suitable considering the driving environment. Therefore, among the many bio-physiological signals affected by the emotional experience, we selected EDA considering two aspects. The first aspect is how much it relates to emotions. According to several related studies, EDA contains the most emotional information [54]. Healey and Picard [49] and Deng et al. [14] showed that EDA is the most representative signal for combination to capture the stress of drivers. The second aspect is the comfort to measure while driving. Cardiac signals are usually measured with electrodes attached to the chest. A driver cannot drive with electrodes on his chest all the time. Although EDA signals are also measured with electrodes, they can be measured on the hand palm. Moreover, recently, they can be measured in dry conditions on a location that has less interference while driving (e.g., wrist) [60]. Hence, we proposed the EDA signal ( E ) and the outputs of the FER ( V ^ , A ^ ) as the input features ( X ) to detect and classify the driver’s real emotion during driving.
Before feeding input features into the deep learning architecture, the SFER model also needs preprocessing to input features ( X ) . Cohn et al. [61] presented that individual differences in facial expressions were moderately strong. Naveteus and Baque [62] presented individual differences in EDA. These individual differences are overcome by personalization work through mean subtraction. Personalization needs to be performed individually. Each individual’s mean value is subtracted from each feature to capture the neutral valence’s baseline, arousal and EDA. Then, every input feature is set to 0–1 through normalization because the personalized input features have different scales. Each feature must be performed by subtracting each sample’s minimum value and dividing it with the range between maximum and minimum values, as shown in Equation (1). Personalization and normalization are possible because input feature ( X ) , especially valence and arousal values ( V ^ , A ^ ) which are from the FER, rely on continuous representation.
We built a DNN, the most fundamental deep learning model, with multiple layers between input and output. Even though there are many advanced deep learning networks for solving difficult tasks, we already secured the FER feature vector ( V ^ , A ^ ) , extracted from the state-of-the-art deep learning model and the EDA feature vector ( E ) selected on the basis of the related studies. Hence, if there are high-quality data for supervised learning, it is possible to classify the driver’s emotional state with only the DNN model through the deep learning training process. The DNN model consists of N and maximum M units of hidden layers. We set the number of final layer units to k and outputs a vector of size k to represent one of the k defined emotions. The output vector pass for the activation function called softmax to represent the confidence probability for each emotion. The DNN model was trained and tested on simulated data, as described in Section 5.2. Algorithm 2 summarizes the detailed procedures in recognizing the driver’s emotional state from the valence ( V ^ ) , arousal ( A ^ ) and EDA ( E ) values.
Algorithm 2 Our Sensor Fusion Emotion Recognition (SFER) Algorithm.
Require: Variable for driver distinguish(Driver ID): I D
   List of defined emotional state categories for driver: C
   The recognized valence and arousal level of driver by facial expressions at time t: ( V ^ t , A ^ t )
   The measured EDA response of driver at time t: E t
Ensure: The recognized real emotion of driver at time t: Y ^ t
   Initialize:
      Let the sensor fusion emotion recognition model be SFER_Model
      Let SFER_Model be the DNN
      Let final hidden layer of SFER_Model be fully connected layer which n units and softmax function as activation function
       V ^ I D = [ ] , A ^ I D = [ ] , E I D = [ ]
   if Accumulated data for the driver with I D exists then
      Load V ^ I D , A ^ I D , E I D
   end if
   while Driving exists do
      # Accumulate data per each individual for personalization
       V ^ I D . i n s e r t ( V ^ t ) ,   A ^ I D . i n s e r t ( A ^ t ) ,   E I D . i n s e r t ( E t )
      # Mean subtraction
       ( V ^ t , A ^ t , E t ) = ( m e a n ( V ^ I D ) , m e a n ( A ^ I D ) , m e a n ( E I D ) )
      # Normalization
       X t T = ( V ^ t m i n ( V ^ I D ) m a x ( V ^ I D ) m i n ( V ^ I D ) , A ^ t m i n ( A ^ I D ) m a x ( A ^ I D ) m i n ( A ^ I D ) , E t m i n ( E I D ) m a x ( E I D ) m i n ( E I D ) )
       Y t 𝕝 ^ = SFER_Model( X t )
       Y ^ t = C [ a r g m a x ( Y t 𝕝 ^ ) ]
   end while

5. Experiments

This section presents the experimental evaluations of the proposed algorithms. We performed our experiments on a machine with Intel Core i9-9980XE CPU at 3.00 GHz, 125 GB RAM and Nvidia Titan RTX GPU. We applied the state-of-the-art CNN models from ILSVRC to our FER models and we used AffectNet [33] database to train and evaluate our FER models. The details are described in Section 5.1. Section 5.2 describes the experimental design of the human-in-the-loop simulation for collecting datasets. These datasets contain driver’s facial images, EDA measurements and induced emotions as the ground truth emotion labels. We used these data to train and evaluate the SFER models. More details are presented in Section 5.2.

5.1. Facial Expression Recognition

As discussed above, we used AffectNet [33], one of the largest databases for facial expressions, to train and evaluate our FER models. It contains approximately 1M facial images collected in the wild and annotation information for the FER. AffectNet contains a significant number of emotions on faces of people of different races, ages and gender. Figure 5 shows sample images in the AffectNet. Moreover, the biggest reason we used AffectNet is that it contains manually annotated intensity of valence and arousal. We only used manually annotated images containing 320,739 training samples and 4500 validation samples, excluding uncertain images. Unfortunately, AffectNet has not yet released the test samples. Hence, we compare the validation set results between their baseline methods in Section 6.1.
Figure 5. Sample images in the AffectNet, including faces of people of different races, ages and gender.
We proposed various FER models for a comparative experiment. Every proposed FER model, evaluated in this study, consists of input shape ( 224 , 224 , 3 ) , several CNN layers and one fully connected layer as the trainable layers. Because we used RGB images, we set the depth of the input image to 3; however, the depth of the input image can be changed to 1 when using the binary images from the NIR camera to secure more robustness against changes in illuminance, as proposed by Gao et al. [5]. In the following, we distinguished the models by their base architectures and number of trainable layers. To validate the effectiveness according to the model’s depth, we designed the models with different depths using VGGNet [28] and ResNet [29]. To validate the performance of the parallelization and the channel-wise attention of CNN layers, we applied ResNeXt [30] and SE block [31] to our FER models. All of our FER models are described as follows, and detailed configurations of each structure are outlined in Table 2, one per column.
Table 2. The configurations of trainable layers for the proposed FER model.
  • VGG14 is based on VGGNet architectures and consisted of 13 CNN layers. The last three fully connected layers are replaced with one fully connected layer with only two units, to represent valence and arousal, respectively. The model has 14 trainable layers, thus it is called VGG14.
  • VGG17 is based on VGGNet, and three more CNN layers are added to VGG14. It consists of 16 CNN layers and 1 fully connected layer.
  • ResNet18 is based on ResNet and has 18 trainable layers (17 CNN layers and 1 fully connected layer). Compared with VGG17, there is only one more CNN layer; however, ResNet18 has the shortcut connection for every two CNN layers, except the first CNN layer. The layers between shortcut connections are represented as curly brackets in Table 2.
  • ResNet34 is based on ResNet and has 34 trainable layers (33 CNN layers and 1 fully connected layer). It also has shortcut connections for every two CNN layers, except the first CNN layer. The layers between shortcut connections are represented as curly brackets in Table 2.
  • ResNet50 is based on ResNet and has 50 trainable layers (49 CNN layers and 1 fully connected layer). It has a shortcut connection for every three CNN layers, except the first CNN layer. The layers between shortcut connections are represented as curly brackets in Table 2.
  • ResNet101 is based on ResNet and has 101 trainable layers (100 CNN layers and 1 fully connected layer). It also has a shortcut connection for every three CNN layers, except the first CNN layer. The layers between shortcut connections are represented as curly brackets in Table 2.
  • ResNeXt34 is based on ResNeXt, and it is composed of 34 trainable layers (33 CNN layers and 1 fully connected layer). The cardinality is set to 32. The last CNN layers between shortcut connections are propagated by splitting them into 32 on a channel basis. In Table 2, the shortcut connections are represented as curly brackets, and the splitting operation is represented as every last layer in curly brackets.
  • SE-ResNet34 applies the SE block to ResNet34. The SE blocks are positioned between the last CNN layers of shortcut connections and the merge points of the shortcut connections, as shown in Table 2. The detailed structure is shown in Figure 3, and the reduction ratio r is set to 4.
  • SE-ResNeXt34 applies the SE block to ResNeXt34. The SE blocks are positioned between the last CNN layers of shortcut connections and the merge points of the shortcut connections, as shown in Table 2. The structure is the same as the SE block of SE-ResNet34, and the reduction ratio r is also set to 4.
We trained our FER models to minimize the distance between the predicted ( V ^ i , A ^ i ) and true ( V i , A i ) values of the valence and arousal using AffectNet [33]. L2 loss function measures the distance and is shown as follows:
L ( V ^ , A ^ , V , A ) = 1 2 n i = 1 n ( ( V ^ i V i ) 2 + ( A ^ i A i ) 2 )
where n is the number of training samples, V ^ i is the predicted valence value of ith training sample, V i is the true valence value of ith training sample, A ^ i is the predicted arousal value of ith training sample, and A i is the true arousal value of ith training sample. We used the Adam algorithm [63], a popular optimizer, to optimize the model parameters. We set the learning rate to 0.001 and the first and second moments to 0.9 and 0.999, respectively. We tried to train over 10 epochs (over 3,207,390 iterations), and the training was terminated when the loss value on the validation set was stable. To compare our models, we used root mean squared error (RMSE) on the validation set:
R M S E = 1 m i = 1 m ( y ^ i y i ) 2
where m is the number of validation samples, y ^ i is the predicted value of ith validation sample and y i is the true value of ith validation sample. The RMSE values of valence and arousal are compared separately.

5.2. Sensor Fusion Emotion Recognition

Thirteen volunteers (six men and seven women) participated in this study, five times per participant. The experiment had to be conducted by inducing eight emotions for all participants; hence, it was impossible to conduct all experiments in a single day. We experimented by grouping two similar emotions (2 emotions/session × 4 session = 8 emotions). One session was conducted as a pretest. All experiments were conducted after obtaining approval from Kookmin University’s IRB (KMU-202005-HR-235).
To study the eight emotions defined in Section 4.2, we needed to induce the participants into each emotion situation or state. We applied a technique that combines film watching and writing passages, as shown in Figure 6. After watching a 4–6 min video to induce the desired emotion, the researchers asked 70 people who are not familiar with our study to watch the video online and gathered their opinion about emotional state after viewing. After confirming that the emotion was induced as intended, the video was used in this experiment. To increase the emotions’ duration and reinforce the emotions induced through the video, we asked the participants to freely describe for 13 min their own experiences related to the emotions induced. The video viewing and self-experience description are two of the most valid emotion induction and reinforcement techniques [53]. During video viewing and self-experience description, we recorded the driver’s facial image and measured his/her EDA. After that, the participants’ self-reported emotions were asked through a survey. Then, the driving was carried out for 5 min in the driving simulator. We recorded the driver’s face image and measured his/her EDA. After finishing the driving, the experimenters debriefed the purpose of the study. In other words, by neutralizing the participants’ emotions, we made sure that the participants’ moods are close to the baseline level when they are leaving the laboratory.
Figure 6. (a) The participants’ emotions are induced through video viewing. (b) The participants describe their own experiences related to the emotions induced.
In the experiment, we used a full-scale driving simulator with six DOF motion base equipped with AV Simulation’s SCANeR Studio 1.7 (AVSimulation, Boulogne-Billancourt, France, https://www.avsimulation.com/ (accessed on 18 March 2021)). The LF Sonata, a Hyundai midsize sedan, was utilized as a cabin. Three-channel projectors and three 2080 mm × 1600 mm screens were connected horizontally to visualize the driving scene. The participants’ physiological signals of EDA was collected using a BioPac bioinstrument (BIOPAC Systems, Inc., Goleta, CA, USA, https://www.biopac.com/ (accessed on 18 March 2021)). The bioinstrument guarantees excess 70 dB of signal to noise ratio (SNR). To acquire a reliable EDA signal, we removed the dead skin cells on the hand to prevent the interruption of signal collection and applied an isotonic electrode paste to the electrode for increasing accuracy. In addition, before starting all experiments, we observed the EDA waveform to confirm that there is no visible noise throughout the signal. For the driver’s face image, we used BRIO 4K (Logitech, Lausanne, Switzerland, https://www.logitech.com/, accessed on 18 March 2021) with 720 × 720 pixel and 30 fps for video viewing, self-experience description and in the driving simulator. Figure 7a shows the full-scale driving simulator. Figure 7b shows the installation of the BioPac bioinstrument and the camera in the driving simulator while the driver is driving. The camera was installed between the windshield and the headliner in front of the sun visor to avoid that driver’s face being partially occluded by the steering wheel or hand.
Figure 7. (a) Three-channel projectors and screens and the cabin of the full-scale driving simulator. (b) The camera is installed between the windshield and the headliner (red) and the biomedical instrument is set on the driver’s wrist for EDA (green).
Among the data collected through simulation, the video and EDA data collected while driving were used for training and evaluating the SFER model and the rest of the data were used for reference. The driving data were acquired while each volunteer was driving for about 5 min per each emotion. The driver’s facial expressions images were acquired with a 30 Hz sampling rate, and the EDA data were acquired when considering a 100 Hz sampling rate. The acquired driving data cannot be used for training of the FER models. To train the FER models, the true valence and arousal values are required for each facial image, but the driving data involve the induced emotion as the ground truth label.
To validate the effectiveness of input features and model structure, we proposed various SFER models. Each proposed SFER model, evaluated in this study, consists of fully connected layers as the input layer, output layer and multiple hidden layers. Every output vector from the hidden layers passes through the ReLU activation function, and the output vector from the output layer passes through the softmax activation function. In the proposed SFER model, we set the number of output layer units to 8 because the number of emotional state categories for driver ( k ) is defined in eight categories, as described in Section 4.2. The models are distinguished by their input features, several hidden layers ( L ) and several maximum units ( U ) of hidden layers. In the following, the front part of the model name means the kind of input features. If the front part of the model name is VA or E, the model uses only the output value of the FER model ( V ^ , A ^ ) or the EDA ( E ) value as an input. If the model name starts with VAE, both of the output value of the FER model ( V ^ , A ^ ) and EDA ( E ) value are used as input values. The number in parentheses of the model name means the number of hidden layers and maximum units as ( L , U ) . All the proposed SFER models are described as follows. Table 3 presents their detailed configurations.
Table 3. Configurations of the proposed SFER models.
  • E ( 3 , 64 ) : This only uses EDA ( E ) value measured as the input value. The number of input layer units is 1, and this SFER model recognizes the driver’s emotional state with only bio-physiological information. The number of hidden layers is 3 and the number of maximum units is 64.
  • E ( 8 , 512 ) : This only uses EDA ( E ) value measured equal to E ( 3 , 64 ) . However, the number of layers and maximum units of the model are made deeper and wider than E ( 3 , 64 ) . The number of hidden layers is 8 and the number of maximum units is 512.
  • VA ( 3 , 64 ) : This only uses valence ( V ^ ) and arousal ( A ^ ) values from the FER model as input values. The number of input layer units is 2, and this SFER model recognizes the driver’s emotional state with only the FER information. The number of hidden layers is 3 and the number of maximum units is 64.
  • VA ( 8 , 512 ) : This only uses valence ( V ^ ) and arousal ( A ^ ) values equal to VA ( 3 , 64 ) . However, the number of layers and maximum units of the model are made deeper and wider than VA ( 3 , 64 ) . The number of hidden layers is 8 and the number of maximum units is 512.
  • VAE ( 3 , 64 ) : This uses valence ( V ^ ) , arousal ( A ^ ) and EDA ( E ) values as input values. The number of input layer units is 3, and this SFER model recognizes the driver’s emotional state with both FER and bio-physiological information. The number of hidden layers is 3 and the number of maximum units is 64.
  • VAE ( 8 , 512 ) : This uses valence ( V ^ ) , arousal ( A ^ ) and EDA ( E ) values equal to VAE ( 3 , 64 ) . However, the number of layers and maximum units of the model are made deeper and wider than VAE ( 3 , 64 ) . The number of hidden layers is 8 and the number of maximum units is 512.
  • VAE ( 9 , 1024 ) : This uses valence ( V ^ ) , arousal ( A ^ ) , and EDA ( E ) values as input values. The number of hidden layers is 9 and the number of maximum units is 1024.
  • VAE ( 9 , 2048 ) : This uses valence ( V ^ ) , arousal ( A ^ ) and EDA ( E ) values as input values. The number of hidden layers is 9 and the number of maximum units is 2048.
  • VAE ( 10 , 1024 ) : This uses valence ( V ^ ) , arousal ( A ^ ) and EDA ( E ) values as input values. The number of hidden layers is 11 and the number of maximum units is 1024.
The minimum vehicle control cycle required by the industry is 10 Hz. To satisfy this requirement, all the proposed SFER models have a 10 Hz recognition frequency. The time window of input data was set to 0.1 s because the driver’s emotions can change in the same period according to the vehicle control state that changes at least 10 Hz in a period. Hence, to train and evaluate these proposed SFER models, filtering was required for each input datum, V ^ , A ^ and E. The output value of the FER model ( V ^ , A ^ ) , which has valence and arousal value per input image, has a sampling rate of 30 Hz. Therefore, the filtering value was calculated as the average value of three valence and arousal values for the previous 0.1 s in every 10 Hz. Because when the EDA data were acquired a 100 Hz sampling rate was considered, the filtering value of the EDA value ( E ) was calculated as the average value of ten EDA values for the previous 0.1 s in every 10 Hz. The average filtering reduces the fine residual noise remaining in the EDA waveform, as shown in Figure 8.
Figure 8. Part of raw and average filtering data of measuring EDA electrical conductance while one of simulation driving.
Through the average filtering with a time window of 0.1 s, the total number of input data ( V ^ , A ^ , E ) in which one valence, one arousal and one EDA value set is observed is 310,389. The data were divided into training and test sets at an 8:2 ratio; 20% of the training set was used for validation to prevent overfitting. Hence, we trained the SFER model with a training and validation set containing 198,610 and 49,653 data, respectively. We evaluated the trained SFER model with a test set involving 62,066 data.
The proposed SFER models require all input and output to be numeric because they operate by a series of numerical operations from input to output. This means that the driver’s defined emotional state categories ( C ) must be converted to a numerical form, and the SFER models’ output value ( Y ^ 𝕝 ) is needed to be converted back into the categories ( C ) . One-hot encoding, which is the most widespread approach for this conversion, creates a separate binary column for each possible category and inserts 1 into the corresponding column. The converted categories ( C 𝕝 ) is shown as follows:
C 1 𝕝 T = [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] C 2 𝕝 T = [ 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ] C 3 𝕝 T = [ 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ] C 8 𝕝 T = [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 ]
where k = 8 , as defined in Section 4.2. Then, induced emotion to the driver ( Y ) composed of C is also converted into Y 𝕝 , comprised of C 𝕝 . Hence, we can find the numerical cross-entropy loss between the induced emotion ( Y 𝕝 ) and predicted emotion ( Y ^ 𝕝 ) :
L ( Y ^ 𝕝 , Y 𝕝 ) = 1 n i = 1 n ( Y i 𝕝 T × l o g ( Y ^ i 𝕝 ) )
where n is the number of training samples, Y ^ i 𝕝 is the one-hot encoded predicted emotion of the ith training sample and Y i 𝕝 is the one-hot encoded induced emotion of the ith training sample. The sum of all elements of Y ^ i 𝕝 is 1 because the output vector passed through softmax activation function to predict. We used the Adam algorithm [63], the same as the FER model training, to optimize the parameter of our proposed SFER models. We set the learning rate to 0.001 and the first and second moments to 0.9 and 0.999, respectively. We tried to train over 30 epochs (over 5,958,300 iterations), and the accuracy evaluation of each model was performed when the loss value on the validation set reached stable point. In order for trained model to output the recognized emotional state, we obtained the index with the largest value of Y ^ 𝕝 and converted it to the emotional category of the corresponding index among the C. Through this conversion, Y ^ 𝕝 was converted back into the recognized real emotional state ( Y ^ ), comprised of C. We compared the SFER models’ accuracy through the correctly recognized ratio with Y ^ and Y.

6. Results

In this section, we present the experiment results of the proposed algorithms. In Section 6.1, the results of training and comparison of the proposed FER models are described. The results of training and comparison of the proposed SFER models are described in Section 6.2. After analyzing the results, the DRER algorithm that we finally proposed is constructed by combining the FER model and SFER model with the best performance. The performance of DRER algorithm when compared with the state-of-the-art algorithms is also presented in Section 6.2.

6.1. Facial Expression Recognition

Figure 9 shows the value of validation loss over the 3M iterations for all our proposed FER models. Some of the training was stopped early if the validation loss had plateaued.
Figure 9. L2 loss function on the validation set of the proposed FER model training.
To validate the effectiveness of the network depth, we compared the RMSE values of valence and arousal on the validation set between models based on VGG and ResNet, as shown in Table 4. If we compared VGG14 and VGG17 results, we expected that VGG17 obtains lower RMSE values because it is a deeper network than VGG14. However, in Table 4, VGG17 has larger RMSE values of both valence and arousal than those of VGG14. This degradation has already been reported in several studies. The RMSE values will degrade on the model based on ResNet because ResNet is the architecture that overcomes this limitation. As expected, ResNet34’s RMSE values of valence and arousal are 0.418 and 0.378 , respectively, which are much lower than those of VGG14. However, looking at the results of ResNet50 and -101, which have deeper networks than ResNet34, the degradation problem is not completely resolved. Even though the models are based on ResNet, the RMSE values increase when the number of trainable layers is deeper than 34. To validate the parallelization performance, we compared ResNeXt34 with ResNet34, which showed the best performance among the proposed models with different depths.
Table 4. RMSE values of valence and arousal on the validation set of the proposed FER.
Table 4 shows that ResNeXt’s RMSE values of valence and arousal are 0.146 and 0.372 , respectively. Increasing the number of layers did not lower the RMSE values; however, better recognition performance was obtained by splitting some of the CNN layers to make them parallel. To validate the channel-wise attention performance, we also compared ResNet with SE-ResNet and ResNeXt with SE-ResNeXt. The results of these models of valence and arousal are 0.419 and 0.377 , respectively, as shown in Table 4. The SE-ResNet34 result shows that the RMSE value of arousal decreased compared with that of ResNet34, but, as the RMSE value of arousal decreased, the RMSE value of valence increased. From the SE-ResNet34 result alone, we cannot say that SENet is effective. However, the SE-ResNeXt34 result shows a decrease of RMSE value of valence from ResNeXt34 with a minute change of arousal’s RMSE value. Hence, SENet can be considered to be effective for the FER. SE-ResNeXt34 shows the best performance ( 0.408 and 0.373 of valence and arousal, respectively) in recognizing the facial expressions as valence and arousal states among all of the proposed models. It is an equivalent result compared with the baseline method proposed by Mollahosseini et al. [33], in which the RMSE values of valence and arousal are 0.37 and 0.41 , respectively. Although the performance for valence and arousal tend to be opposite between each method, they are on the same level overall. The baseline method has two separate models that output the valence and arousal states. While the input face image needs to be cropped, our proposed models output the valence and arousal states at once without cropping. Thus, the proposed SE-ResNeXt34 is a better FER model than the baseline method.

6.2. Sensor Fusion Emotion Recognition

Figure 10 shows the value of validation loss over the 30 epochs for all our proposed SFER models. We found the performance of fusing with bio-physiology signals through the comparison with VA ( 3 , 64 ) , E ( 3 , 64 ) and VAE ( 3 , 64 ) . The validation loss of VAE ( 3 , 64 ) is lower than those of VA ( 3 , 64 ) and E ( 3 , 64 ) on every epoch.
Figure 10. Cross-entropy loss on validation set of the proposed SFER model training.
Table 5 presents the accuracy on test set. In the table, the accuracy of VAE ( 3 , 64 ) is 75% higher than that of VA ( 3 , 64 ) and 99% higher than that of E ( 3 , 64 ) . The same experiment was compared with deeper and wider models: the accuracy of VAE ( 8 , 512 ) is 114% higher than that of VA ( 8 , 512 ) and 146% higher than that of E ( 8 , 512 ) . A more interesting point is the rate of increase in accuracy for the model with fused inputs as the model structure gets more complex. As the model structure using only bio-physiology information has become deeper and wider, from E ( 3 , 64 ) to E ( 8 , 512 ) , the accuracy has increased by 8%. The accuracy of VA ( 8 , 512 ) using only facial information has increased by 9% from VA ( 3 , 64 ) . The accuracy of VAE ( 8 , 512 ) fusing facial and bio-physiology information has increased by 34% from VAE ( 3 , 64 ) . On the basis of these results, we proved the effectiveness of fusing the recognized facial expressions and the driver’s measured EDA information to recognize the driver’s real emotion.
Table 5. The accuracy of the SFER model on test set.
To find the model with the best accuracy while fusing both information, we compared the result for various structures of VAE models, as shown in Table 5. As the number of layers increased to 9 and the maximum number of units increased to 1024, the accuracy improved continuously. Then, the accuracy of VAE ( 9 , 1024 ) is 0.886. However, after VAE ( 9 , 1024 ) , the accuracy of VAE ( 9 , 2048 ) , which has the maximum number of units two times that of VAE ( 9 , 1024 ) , is 0.865. Similarly, the accuracy of VAE ( 10 , 1024 ) , which has one more layer than VAE ( 9 , 1024 ) , is 0.871. Both models have lower accuracy than the that of VAE ( 9 , 1024 ) . Hence, the accuracy does not continue to increase as the model gets deeper and wider. The proposed SFER model showed the highest accuracy of 0.886 with VAE ( 9 , 1024 ) .
Table 6 is the confusion matrix of the evaluation result of VAE ( 9 , 1024 ) , which achieves the best accuracy among the proposed SFER models. It shows the recognition rate for each induced emotions. The highest recognition rate is 0.930 for the “happy” emotion, and the lowest recognition rate is 0.861 for the “depressed” emotion. Every recognition rate is between 0.861 and 0.930 . Thus, the proposed SFER model can recognize evenly without bias to any emotion.
Table 6. Confusion matrix of the evaluate result of VAE ( 9 , 1024 ) .
The receiver operating characteristic (ROC) curve is a plot with a true positive rate (TPR) on the y-axis against a false positive rate (FPR) on the x-axis. Figure 11 presents the ROC curve using one-versus-rest method for each emotion on the test set. It involves eight graphs, including VAE ( 9 , 1024 ) , VAE ( 8 , 512 ) , VA ( 8 , 512 ) and E ( 8 , 512 ) ROC curves to compare the accuracy depending on the input features. There was little difference between VAE ( 9 , 1024 ) and VAE ( 8 , 512 ) on the ROC curve, which is located above VA ( 8 , 512 ) and E ( 8 , 512 ) . Although VA ( 8 , 512 ) was higher on the ROC curve than E ( 8 , 512 ) in the case of most emotions, with respect to relieved and fearful, there was little difference between VA ( 8 , 512 ) and E ( 8 , 512 ) .
Figure 11. ROC curve for each defined emotion.
Table 7 shows the values of area under the curve (AUC) of the ROC curves in Figure 11 according to the emotions considering in each model. The AUC of classification models with good performance is close to 1, which means it has a good measure of separability. As shown in Table 7, the average AUCs of VAE ( 9 , 1024 ) and VAE ( 8 , 512 ) were 0.994 , which was 20% and 30% higher than those of VA ( 8 , 512 ) and E ( 8 , 512 ) , respectively. VAE ( 9 , 1024 ) and E ( 8 , 512 ) had high AUC for all emotions, whereas VA ( 8 , 512 ) had the lowest AUC at 0.790 for neutral and the highest AUC at 0.848 for depressed. E ( 8 , 512 ) had the lowest AUC of 0.705 at depressed and the highest AUC of 0.826 at relieved.
Table 7. AUC result of the proposed SFER models.
Based on the above experiment results, we realized the best FER model, SE-ResNeXt, and the best SFER model, VAE ( 9 , 1024 ) . Therefore, we proposed the DRER algorithm that recognizes the driver’s real emotion while driving by combining SE-ResNeXt and VAE ( 9 , 1024 ) . The performance of the DRER algorithm in comparison with other state-of-the-art algorithms is shown in Table 8. A good emotion recognition model should have the following qualities: high accuracy and various emotional states. Although Machot et al. achieved high accuracy using only bio-physiological signal, they only classified four emotional states [22]. Comas et al. classified seven emotional states by fusing the facial expressions and bio-physiological signals, but the accuracy was 64% [27]. On the other hand, the DRER algorithm achieved a high emotion recognition accuracy of 0.89 for eight different emotional states using facial expressions and EDA as the input features. The DRER algorithm we proposed shows the highest accuracy while classifying the most emotional states.
Table 8. State-of-the-art algorithms for emotion recognition.

7. Conclusions and Future Work

In this paper, we propose the DRER algorithm based on deep learning algorithm to recognize the driver’s emotional state that does not always appear clearly on their face while driving. On the basis of CNN, we propose the FER model, an end-to-end architecture, to recognize the driver’s facial expression state through the driver’s face image photographed without any additional work. Then, on the basis of DNN, we propose the SFER model to recognize the driver’s real emotional state, which is not fully revealed on the face, by fusing the recognized facial expression state and the driver’s EDA signals. We define the appropriated driver’s emotion categories, and the output of our proposed SFER model is represented to the categories defined. We trained and evaluated our SFER model using the data collected from a human-in-the-loop simulation with a full-scale driving simulator. We proposed the DRER model by combining the best performing FER model model and SFER model in our experiments. As a result, our proposed DRER model achieved 88.6% accuracy using only the face image and EDA signal for the driver’s induced emotion while driving situation. Therefore, our DRER model is expected to be more reliable than the existing FER models and will be useful for services based on the driver’s emotional state.
There are multiple directions along which our proposed models could be robust in future work. The first is considering emotional continuity. Our DRER model recognized emotional state with only a short moment’s information, but the emotions have a continuous nature. Thus, if our proposed model considered the sequential aspect, its recognition performance would likely significantly improve to recognize driver’s emotion. Second, based on experiments using real vehicle, the robustness of the algorithm can be considerably improved with respect to the external environment. Even though the full-scale driving simulator that we used is very similar to the real vehicle environment, the external environment is completely controlled. It is likely that a future model trained with many data collected from the real vehicle can show high accuracy in uncontrolled actual driving situations. Third, in this study, we did not consider real-time recognition. Optimizing the frequency and time window through run time analysis would likely make it possible to build a real-time system for monitoring the real emotions of drivers.

Author Contributions

Conceptualization, S.L. (Sejoon Lim) and S.L. (Sangho Lee); methodology, G.O. and J.R.; software, G.O., J.R. and E.J.; validation, J.H.Y., S.L. (Sejoon Lim) and S.L. (Sangho Lee); formal analysis, G.O.; investigation, G.O., J.H.Y. and S.L. (Sejoon Lim); resources, G.O., J.H.Y. and S.H.; data curation, G.O., J.R., E.J. and S.H.; writing—original draft preparation, G.O.; writing—review and editing, J.H.Y. and S.L. (Sejoon Lim); visualization, G.O. and J.R.; supervision, S.L. (Sejoon Lim); project administration, S.L. (Sejoon Lim) and S.L. (Sangho Lee); and funding acquisition, J.H.Y., S.L. (Sejoon Lim) and S.L. (Sangho Lee). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Hyundai Motor Group, the Knowledge Service Industry Core Technology Development Program funded by the Ministry of Trade, Industry, and Energy of Korea (No. 20003519), the Basic Science Research Program of the National Research Foundation of Korea funded by the Ministry of Science, ICT, and Future Planning (No. 2021R1A2C1005433), and the BK21 Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 5199990814084).

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Kookmin University (protocol code: KMU-202005-HR-235; date of approval: 28 October 2020).

Acknowledgments

The authors thank Seungjoon Lee, Youngdong Kwon, and Myeongkyu Lee for collecting and analyzing the survey data.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DRERDriver’s Real Emotion Recognizer
FERFacial Expressions Recognition
SFERSensor Fusion Emotion Recognition
EEGElectroencephalogram
ECGElectrocardiogram
PPGPhotoplethysmography
EDAElectrodermal activity
CNNConvolutional neural network
DNNDeep neural network
ROCReceiver Operating Characteristic
AUCArea Under the Curve

References

  1. Underwood, G.; Chapman, P.; Wright, S.; Crundall, D. Anger while driving. Transp. Res. Part F Traffic Psychol. Behav. 1999, 2, 55–68. [Google Scholar] [CrossRef]
  2. Jeon, M. Don’t cry while you’re driving: Sad driving is as bad as angry driving. Int. J. Hum. Comput. Interact. 2016, 32, 777–790. [Google Scholar] [CrossRef]
  3. Kahou, S.E.; Bouthillier, X.; Lamblin, P.; Gulcehre, C.; Michalski, V.; Konda, K.; Jean, S.; Froumenty, P.; Dauphin, Y.; Boulanger-Lewandowski, N.; et al. Emonets: Multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces 2016, 10, 99–111. [Google Scholar] [CrossRef]
  4. Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 445–450. [Google Scholar]
  5. Gao, H.; Yüce, A.; Thiran, J.P. Detecting emotional stress from facial expressions for driving safety. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 5961–5965. [Google Scholar]
  6. Chang, W.Y.; Hsu, S.H.; Chien, J.H. FATAUVA-Net: An integrated deep learning framework for facial attribute recognition, action unit detection, and valence-arousal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 17–25. [Google Scholar]
  7. Kollias, D.; Zafeiriou, S. A multi-task learning & generation framework: Valence-arousal, action units & primary expressions. arXiv 2018, arXiv:1811.07771. [Google Scholar]
  8. Theagarajan, R.; Bhanu, B.; Cruz, A. DeepDriver: Automated System For measuring Valence and Arousal in Car Driver Videos. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2546–2551. [Google Scholar]
  9. Ekman, P.; Friesen, W.V. Nonverbal leakage and clues to deception. Psychiatry 1969, 32, 88–106. [Google Scholar] [CrossRef] [PubMed]
  10. Ekman, P. Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage (Revised Edition); WW Norton & Company: New York, NY, USA, 2009. [Google Scholar]
  11. Porter, S.; Ten Brinke, L. Reading between the lies: Identifying concealed and falsified emotions in universal facial expressions. Psychol. Sci. 2008, 19, 508–514. [Google Scholar] [CrossRef]
  12. Yan, W.J.; Wu, Q.; Liang, J.; Chen, Y.H.; Fu, X. How fast are the leaked facial expressions: The duration of micro-expressions. J. Nonverbal Behav. 2013, 37, 217–230. [Google Scholar] [CrossRef]
  13. Oh, Y.H.; See, J.; Le Ngo, A.C.; Phan, R.C.W.; Baskaran, V.M. A survey of automatic facial micro-expression analysis: Databases, methods, and challenges. Front. Psychol. 2018, 9, 1128. [Google Scholar] [CrossRef]
  14. Deng, Y.; Wu, Z.; Chu, C.H.; Zhang, Q.; Hsu, D.F. Sensor feature selection and combination for stress identification using combinatorial fusion. Int. J. Adv. Robot. Syst. 2013, 10, 306. [Google Scholar] [CrossRef]
  15. Ooi, J.S.K.; Ahmad, S.A.; Chong, Y.Z.; Ali, S.H.M.; Ai, G.; Wagatsuma, H. Driver emotion recognition framework based on electrodermal activity measurements during simulated driving conditions. In Proceedings of the 2016 IEEE EMBS Conference on Biomedical Engineering and Sciences (IECBES), Kuala Lumpur, Malaysia, 4–7 December 2016; pp. 365–369. [Google Scholar]
  16. Zhong, B.; Qin, Z.; Yang, S.; Chen, J.; Mudrick, N.; Taub, M.; Azevedo, R.; Lobaton, E. Emotion recognition with facial expressions and physiological signals. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar]
  17. Dzedzickis, A.; Kaklauskas, A.; Bucinskas, V. Human emotion recognition: Review of sensors and methods. Sensors 2020, 20, 592. [Google Scholar] [CrossRef]
  18. Raheel, A.; Majid, M.; Alnowami, M.; Anwar, S.M. Physiological sensors based emotion recognition while experiencing tactile enhanced multimedia. Sensors 2020, 20, 4037. [Google Scholar] [CrossRef] [PubMed]
  19. Liu, S.; Wang, X.; Zhao, L.; Zhao, J.; Xin, Q.; Wang, S. Subject-independent Emotion Recognition of EEG Signals Based on Dynamic Empirical Convolutional Neural Network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020. [Google Scholar] [CrossRef] [PubMed]
  20. Chao, H.; Liu, Y. Emotion recognition from multi-channel EEG signals by exploiting the deep belief-conditional random field framework. IEEE Access 2020, 8, 33002–33012. [Google Scholar] [CrossRef]
  21. Zheng, S.; Peng, C.; Fang, F.; Liu, X. A Novel Fuzzy Rough Nearest Neighbors Emotion Recognition Approach Based on Multimodal Wearable Biosensor Network. J. Med. Imaging Health Inform. 2020, 10, 710–717. [Google Scholar] [CrossRef]
  22. Al Machot, F.; Elmachot, A.; Ali, M.; Al Machot, E.; Kyamakya, K. A deep-learning model for subject-independent human emotion recognition using electrodermal activity sensors. Sensors 2019, 19, 1659. [Google Scholar] [CrossRef]
  23. Santamaria-Granados, L.; Munoz-Organero, M.; Ramirez-Gonzalez, G.; Abdulhay, E.; Arunkumar, N. Using deep convolutional neural network for emotion detection on a physiological signals dataset (AMIGOS). IEEE Access 2018, 7, 57–67. [Google Scholar] [CrossRef]
  24. Rayatdoost, S.; Rudrauf, D.; Soleymani, M. Multimodal Gated Information Fusion for Emotion Recognition from EEG Signals and Facial Behaviors. In Proceedings of the 2020 International Conference on Multimodal Interaction, Utrecht, The Netherlands, 25–26 October 2020; pp. 655–659. [Google Scholar]
  25. Siddharth, S.; Jung, T.P.; Sejnowski, T.J. Utilizing deep learning towards multi-modal bio-sensing and vision-based affective computing. IEEE Trans. Affect. Comput. 2019. [Google Scholar] [CrossRef]
  26. Val-Calvo, M.; Álvarez-Sánchez, J.R.; Ferrández-Vicente, J.M.; Fernández, E. Affective Robot Story-Telling Human-Robot Interaction: Exploratory Real-Time Emotion Estimation Analysis Using Facial Expressions and Physiological Signals. IEEE Access 2020, 8, 134051–134066. [Google Scholar] [CrossRef]
  27. Comas, J.; Aspandi, D.; Binefa, X. End-to-end facial and physiological model for affective computing and applications. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 93–100. [Google Scholar]
  28. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  30. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  31. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  32. Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
  33. Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
  34. Lucey, P.; Cohn, J.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition—Workshops (CVPRW 2010), San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar] [CrossRef]
  35. Kosti, R.; Alvarez, J.M.; Recasens, A.; Lapedriza, A. EMOTIC: Emotions in Context dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 61–69. [Google Scholar]
  36. Subramanian, R.; Wache, J.; Abadi, M.K.; Vieriu, R.L.; Winkler, S.; Sebe, N. ASCERTAIN: Emotion and Personality Recognition Using Commercial Sensors. IEEE Trans. Affect. Comput. 2018, 9, 147–160. [Google Scholar] [CrossRef]
  37. Soleymani, M.; Lichtenauer, J.; Pun, T.; Pantic, M. A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 2011, 3, 42–55. [Google Scholar] [CrossRef]
  38. Jeong, D.; Kim, B.G.; Dong, S.Y. Deep Joint Spatiotemporal Network (DJSTN) for Efficient Facial Expression Recognition. Sensors 2020, 20, 1936. [Google Scholar] [CrossRef] [PubMed]
  39. Riaz, M.N.; Shen, Y.; Sohail, M.; Guo, M. eXnet: An Efficient Approach for Emotion Recognition in the Wild. Sensors 2020, 20, 1087. [Google Scholar] [CrossRef] [PubMed]
  40. Kortelainen, J.; Tiinanen, S.; Huang, X.; Li, X.; Laukka, S.; Pietikäinen, M.; Seppänen, T. Multimodal emotion recognition by combining physiological signals and facial expressions: A preliminary study. In Proceedings of the 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Diego, CA, USA, 28 August–1 September 2012; pp. 5238–5241. [Google Scholar]
  41. Huang, X.; Kortelainen, J.; Zhao, G.; Li, X.; Moilanen, A.; Seppänen, T.; Pietikäinen, M. Multi-modal emotion analysis from facial expressions and electroencephalogram. Comput. Vis. Image Underst. 2016, 147, 114–124. [Google Scholar] [CrossRef]
  42. Huang, Y.; Yang, J.; Liu, S.; Pan, J. Combining facial expressions and electroencephalography to enhance emotion recognition. Future Internet 2019, 11, 105. [Google Scholar] [CrossRef]
  43. Soleymani, M.; Asghari-Esfeden, S.; Pantic, M.; Fu, Y. Continuous emotion detection using EEG signals and facial expressions. In Proceedings of the 2014 IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China, 14–18 July 2014; pp. 1–6. [Google Scholar]
  44. Katsigiannis, S.; Ramzan, N. DREAMER: A database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices. IEEE J. Biomed. Health Inform. 2017, 22, 98–107. [Google Scholar] [CrossRef]
  45. Sharma, K.; Castellini, C.; van den Broek, E.L.; Albu-Schaeffer, A.; Schwenker, F. A dataset of continuous affect annotations and physiological signals for emotion analysis. Sci. Data 2019, 6, 1–13. [Google Scholar] [CrossRef]
  46. Angkititrakul, P.; Hansen, J.H.; Choi, S.; Creek, T.; Hayes, J.; Kim, J.; Kwak, D.; Noecker, L.T.; Phan, A. UTDrive: The smart vehicle project. In In-Vehicle Corpus and Signal Processing for Driver Behavior; Springer: Boston, MA, USA, 2009; pp. 55–67. [Google Scholar]
  47. Ma, Z.; Mahmoud, M.; Robinson, P.; Dias, E.; Skrypchuk, L. Automatic detection of a driver’s complex mental states. In Proceedings of the International Conference on Computational Science and Its Applications, Trieste, Italy, 3–6 July 2017; pp. 678–691. [Google Scholar]
  48. Kawaguchi, N.; Matsubara, S.; Takeda, K.; Itakura, F. CIAIR in-car speech corpus–influence of driving status–. IEICE Trans. Inf. Syst. 2005, 88, 578–582. [Google Scholar] [CrossRef]
  49. Healey, J.A.; Picard, R.W. Detecting stress during real-world driving tasks using physiological sensors. IEEE Trans. Intell. Transp. Syst. 2005, 6, 156–166. [Google Scholar] [CrossRef]
  50. Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
  51. Jeon, M.; Walker, B.N. What to detect? Analyzing factor structures of affect in driving contexts for an emotion detection and regulation system. In The 55th Annual Meeting of the Human Factors and Ergonomics Society; Human Factors and Ergonomics Society: Los Angeles, CA, USA, 2011; Volume 55, pp. 1889–1893. [Google Scholar]
  52. Jeon, M. (Ed.) Emotions in driving. In Emotions and Affect in Human Factors and Human-Computer Interaction; Academic Press: San Diago, CA, USA, 2017; pp. 437–474. [Google Scholar]
  53. Fakhrhosseini, S.M.; Jeon, M. Affect/emotion induction methods. In Emotions and Affect in Human Factors and Human-Computer Interaction; Jeon, M., Ed.; Academic Press: Amsterdam, The Netherlands, 2017; pp. 235–253. [Google Scholar]
  54. Zepf, S.; Hernandez, J.; Schmitt, A.; Minker, W.; Picard, R.W. Driver Emotion Recognition for Intelligent Vehicles: A Survey. ACM Comput. Surv. (CSUR) 2020, 53, 1–30. [Google Scholar] [CrossRef]
  55. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
  56. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR09), Miami, FL, USA, 20–25 June 2009. [Google Scholar]
  57. Roy, A.G.; Navab, N.; Wachinger, C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. IEEE Trans. Med Imaging 2018, 38, 540–549. [Google Scholar] [CrossRef]
  58. Barrett, L.F.; Adolphs, R.; Marsella, S.; Martinez, A.M.; Pollak, S.D. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychol. Sci. Public Interest 2019, 20, 1–68. [Google Scholar] [CrossRef] [PubMed]
  59. Dittrich, M.; Zepf, S. Exploring the validity of methods to track emotions behind the wheel. In Proceedings of the International Conference on Persuasive Technology, Limassol, Cyprus, 9–11 April 2019; pp. 115–127. [Google Scholar]
  60. Garbarino, M.; Lai, M.; Bender, D.; Picard, R.W.; Tognetti, S. Empatica E3—A wearable wireless multi-sensor device for real-time computerized biofeedback and data acquisition. In Proceedings of the 2014 4th International Conference on Wireless Mobile Communication and Healthcare-Transforming Healthcare Through Innovations in Mobile and Wireless Technologies (MOBIHEALTH), Athens, Greece, 3–5 November 2014; pp. 39–42. [Google Scholar]
  61. Cohn, J.F.; Schmidt, K.; Gross, R.; Ekman, P. Individual differences in facial expression: Stability over time, relation to self-reported emotion, and ability to inform person identification. In Proceedings of the Fourth IEEE International Conference on Multimodal Interfaces, Pittsburgh, PA, USA, 16 October 2002; pp. 491–496. [Google Scholar]
  62. Naveteur, J.; Baque, E.F.I. Individual differences in electrodermal activity as a function of subjects’ anxiety. Personal. Individ. Differ. 1987, 8, 615–626. [Google Scholar] [CrossRef]
  63. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.