Driver Emotion and Fatigue State Detection Based on Time Series Fusion

: Studies have shown that driver fatigue or unpleasant emotions signiﬁcantly increase driving risks. Detecting driver emotions and fatigue states and providing timely warnings can effectively minimize the incidence of trafﬁc accidents. However, existing models rarely combine driver emotion and fatigue detection, and there is space to improve the accuracy of recognition. In this paper, we propose a non-invasive and efﬁcient detection method for driver fatigue and emotional state, which is the ﬁrst time to combine them in the detection of driver state. Firstly, the captured video image sequences are preprocessed, and Dlib (image open source processing library) is used to locate face regions and mark key points; secondly, facial features are extracted, and fatigue indicators, such as driver eye closure time (PERCLOS) and yawn frequency are calculated using the dual-threshold method and fused by mathematical methods; thirdly, an improved lightweight RM-Xception convolutional neural network is introduced to identify the driver’s emotional state; ﬁnally, the two indicators are fused based on time series to obtain a comprehensive score for evaluating the driver’s state. The results show that the fatigue detection algorithm proposed in this paper has high accuracy, and the accuracy of the emotion recognition network reaches an accuracy rate of 73.32% on the Fer2013 dataset. The composite score calculated based on time series fusion can comprehensively and accurately reﬂect the driver state in different environments and make a contribution to future research in the ﬁeld of assisted safe driving.


Introduction
According to statistics, the number of deaths caused by traffic accidents has reached 1.35 million worldwide [1], and among them, traffic accidents caused by fatigued driving account for approximately 20%-30% of all traffic accidents [2]. Studies have shown that uncontrollable emotions are one of the primary elements that raise driving risks [3], such as anger that may cause road rage [4], and sadness and stress that can reduce driver concentration [5]. Approximately 90% of traffic accidents can be avoided if drivers are warned before they occur [6]. Therefore, in order to reduce and avoid traffic accidents, it is significant to identify driver fatigue and emotional state and warn them with the help of assisted driving systems.
The existing driver fatigue detection methods can be divided into three main categories: based on vehicle behavior [7], based on physiological signals [8][9][10][11], and based on visual features [12][13][14]. Based on vehicle behavior, the driver's fatigue can be indirectly judged, such as whether the car is pressing the line, whether the distance between the car and another car is too close, etc. However, due to the complex road conditions of the actual scene and the great differences in drivers' driving habits, it is difficult to develop a unified standard to determine fatigue, and its main drawback is the low accuracy rate; common physiological signals used to detect fatigue include electrocardiogram (ECG), The remainder of the paper is organized as follows. In Section 2 we present the algorithm design for driver emotion and fatigue detection and how to integrate the two. Section 3 conducts experimental tests of the proposed algorithm. Section 4 discusses the main contributions made in this paper and proposes future research directions.

Fatigue Detection Methods
For driver fatigue detection based on vehicle information, Li et al. detected driver fatigue by detecting the driver's grip on the steering wheel, extracted fatigue features using wavelet transform, and compared the performance of algorithms, such as SVM and K-neighborhood in distinguishing driver status [19]. Zhang et al. proposed a driver fatigue detection method based on steering wheel angle features, built a detection model using support vector machines, and optimized the model parameters through cross-validation [20].
For driver fatigue detection based on physiological signals, Sobhan et al. proposed a fatigue detection model based on a deep convolutional neural network-long and short term memory network to extract fatigue features from six active regions and raw EEG data [10]. Chai et al. proposed an EEG-based dichotomous fatigue detection method using autoregressive modeling for feature extraction and Bayesian neural networks for feature classification [11]. Lin et al. proposed a novel brain-computer interface system to detect human physiological states by acquiring EEG signals in real time. The system is implemented to detect drowsiness in real time based on EEG and provide warning information to the user when there is a need [21]. For driver fatigue detection based on visual information, Zhu et al. designed a driver fatigue detection algorithm based on facial key points by constructing a deep convolutional network to detect face regions, calculating eye aspect ratio (EAR), mouth aspect ratio (MAR) and eye closure time the driver fatigue assessment model was established by calculating eye aspect ratio (EAR), mouth aspect ratio (MAR) and eye closure time percentage (PERCLOS) based on facial key points [22]. He et al. proposed a fatigue detection method based on two CNN cascades, and the two CNNs were used for face feature detection and eye and mouth state classification, respectively [23]. Bin et al. proposed a fatigue detection algorithm based on facial multi-feature fusion with blink and yawn frequencies [24]. Li et al. designed a fatigue driving detection algorithm based on facial multi-feature fusion, which introduces an improved YOLOv3 algorithm to capture facial regions and calculate the driver's eye closure time, blink frequency, and yawn frequency to assess the fatigue state through eye and mouth feature vectors [25]. Chen et al. proposed a fatigue detection model based on BP neural network and time-accumulation effect to reflect the accumulation process of fatigue with time [15]. Yu et al. proposed a fatigue detection method based on 3D deep convolutional neural networks, with input multiple frames to generate a spatiotemporal representation, and combined with scene conditions to generate fused features for discriminating sleepiness detection [17]. Facial multi-feature fusion can greatly improve the accuracy and robustness of fatigue detection.

Emotion Recognition Methods
For emotion recognition based on physiological signals, Robert et al. proposed an EEG-based feature extraction method for emotion recognition, using a machine learning technique to filter and compare features on a dataset they created [26]. Joao et al. used facial electromyography for the detection of emotional states and proposed a framework that combines electromyographic detection of all over expressions and oculogram-based sweep detection to classify emotions into four categories: neutral, sad, angry and happy [27].
To recognize people's mental states based on speech signals, Renato et al. propose emotion-related audio functions to advance music emotion recognition techniques. The paper proposes algorithms related to musical textures and expression techniques, and creates a public dataset of 900 audios to evaluate the algorithms [28]. Han et al. used deep neural networks to generate probability distributions of emotional states for speech segments, then constructed discourse-level features from the probability distributions and then fed these features into an extreme learning machine to achieve speech emotion recognition [29].
For emotion recognition based on visual features, Mohan et al. designed a facial expression recognition model based on deep convolutional networks using a hierarchical fusion method of local feature classification and overall feature classification [12]. Minaee et al. proposed a facial expression recognition method using attentional convolutional networks [13]. Xiao et al. designed a facial expression-based driver emotion recognition network called FERDERnet, which mainly consists of three parts: a face detection module, a data enhancement and resampling module, and an emotion recognition module [14]. Li et al. proposed an emotion recognition method based on video sequences, using the visual information of facial expression sequences and the speech information of audio to fuse the judgment, and using convolutional neural networks to improve the recognition performance of facial expressions [18]. Kansizoglou et al. implemented continuous emotion recognition via recurrent neural networks for long-term behavior modeling [30]. The above emotion recognition methods, which build more complex models and require processing a large amount of image data, present new challenges to the computing power of computers.

Materials and Methods
This system collects and processes the driver's facial information, analyzes the driver's emotion and fatigue level in real-time, effectively monitors and actively warns of accidents

Image Pre-Processing
In this paper, we employ Raspberry and CSI cameras to capture video data. The acquired color image stream will be grayed out frame by frame, and the grayed-out image can reduce the impact of external factors, such as illumination, while preserving the image information intact, and the data volume is small and easy to conduct matrix operations.
When the influence of environmental factors, such as illumination on the image is high, the grayscale operation can play a limited role, and the extracted information does not adequately reflect the emotional and fatigue information. In this paper, histogram equalization is performed on the grayed-out image, and the local information in the processed image changes significantly, and the darker parts become brighter while the brighter parts do not appear exposed, which can solve the impact of uneven illumination on feature extraction.

Face and Key Point Detection
Dlib is an open-source toolkit built on C++ that includes a Python development interface. Among them, shape_predictor_68_face_landmarks.dat is used for facial 68 key points detection. This model library includes two important methods, face detector and face key points predictor, for face feature extraction, returning the coordinates of face  In this paper, we employ Raspberry and CSI cameras to capture video data. The acquired color image stream will be grayed out frame by frame, and the grayed-out image can reduce the impact of external factors, such as illumination, while preserving the image information intact, and the data volume is small and easy to conduct matrix operations.
When the influence of environmental factors, such as illumination on the image is high, the grayscale operation can play a limited role, and the extracted information does not adequately reflect the emotional and fatigue information. In this paper, histogram equalization is performed on the grayed-out image, and the local information in the processed image changes significantly, and the darker parts become brighter while the brighter parts do not appear exposed, which can solve the impact of uneven illumination on feature extraction.

Face and Key Point Detection
Dlib is an open-source toolkit built on C++ that includes a Python development interface. Among them, shape_predictor_68_face_landmarks.dat is used for facial 68 key points detection. This model library includes two important methods, face detector and face key points predictor, for face feature extraction, returning the coordinates of face feature points, face angle, and other parameters. The Dlib library detects and extracts the facial regions of interest in images quickly.
After pre-processing, the image is given to the face detector. After the face is successfully detected, a bounding box is applied to the image to extract the region of interest (ROI) for further analysis. If the face is not detected, the next frame will be processed. The extracted ROI will be fed into the face key point predictor to mark 68 key points of the face, as shown in Figure 2. feature points, face angle, and other parameters. The Dlib library detects and extracts the facial regions of interest in images quickly.
After pre-processing, the image is given to the face detector. After the face is successfully detected, a bounding box is applied to the image to extract the region of interest (ROI) for further analysis. If the face is not detected, the next frame will be processed. The extracted ROI will be fed into the face key point predictor to mark 68 key points of the face, as shown in Figure 2.

Head posture estimation
Head posture is one of the indispensable indicators for fatigue detection of the driver. When employing the camera for head posture estimation of the driver, the conversion of the coordinate system is indispensable. The coordinate systems involved include four types of coordinate systems: world coordinate system, camera coordinate system, image center coordinate system, and pixel coordinate system. A 3D rigid body has two types of motion relative to the camera, translation and rotation. The translation motion includes X, Y, and Z degrees of the freedom movement, while the rotation motion is described by Euler angles, including three types of horizontal roll, vertical pitch, and yaw. The essence of the driver's head posture estimation is to find these six parameters, as shown in Figure  3. Suppose that a point P (U, V, W) in the world coordinate system is known. Assuming that the rotation matrix and translation vector are known to be R and t, respectively, we can then calculate the position of P in the camera coordinate system (X, Y, Z) as follows.

Head Posture Estimation
Head posture is one of the indispensable indicators for fatigue detection of the driver. When employing the camera for head posture estimation of the driver, the conversion of the coordinate system is indispensable. The coordinate systems involved include four types of coordinate systems: world coordinate system, camera coordinate system, image center coordinate system, and pixel coordinate system. A 3D rigid body has two types of motion relative to the camera, translation and rotation. The translation motion includes X, Y, and Z degrees of the freedom movement, while the rotation motion is described by Euler angles, including three types of horizontal roll, vertical pitch, and yaw. The essence of the driver's head posture estimation is to find these six parameters, as shown in Figure 3.
After pre-processing, the image is given to the face detector. After the face is successfully detected, a bounding box is applied to the image to extract the region of interest (ROI) for further analysis. If the face is not detected, the next frame will be processed. The extracted ROI will be fed into the face key point predictor to mark 68 key points of the face, as shown in Figure 2.

Head posture estimation
Head posture is one of the indispensable indicators for fatigue detection of the driver. When employing the camera for head posture estimation of the driver, the conversion of the coordinate system is indispensable. The coordinate systems involved include four types of coordinate systems: world coordinate system, camera coordinate system, image center coordinate system, and pixel coordinate system. A 3D rigid body has two types of motion relative to the camera, translation and rotation. The translation motion includes X, Y, and Z degrees of the freedom movement, while the rotation motion is described by Euler angles, including three types of horizontal roll, vertical pitch, and yaw. The essence of the driver's head posture estimation is to find these six parameters, as shown in Figure  3. Suppose that a point P (U, V, W) in the world coordinate system is known. Assuming that the rotation matrix and translation vector are known to be R and t, respectively, we can then calculate the position of P in the camera coordinate system (X, Y, Z) as follows.
[ ] Suppose that a point P (U, V, W) in the world coordinate system is known. Assuming that the rotation matrix and translation vector are known to be R and t, respectively, we can then calculate the position of P in the camera coordinate system (X, Y, Z) as follows.
From the camera, the coordinate system to the pixel coordinate system can be calculated from Equation (2).
where fx and fy are the lengths of the focal lengths in the x-and y-axis directions, respectively, (cx, cy) is the optical center, and for practical applications, the radial aberration parameter is omitted, and S is the scale factor. Therefore, the relationship between the pixel coordinate system and the world coordinate system is shown as follows: The equation can be solved by DLT (Direct Linear Transform) and least squares, which can calculate the rotation and translation matrices from which the Euler angles can be found.
Opencv includes APIs for head pose estimation: solvePnP and solvePnPRansac. In this paper, solvePnP is used to solve the matrix equation.

Eye and Mouth Aspect Ratio Definition
The blink information of the driver's eye is one of the most important indicators to reflect the fatigue status. To determine whether the driver blinks or not, the eye aspect ratio (Eye Aspect Ratio, E AR ) can be calculated by using the Euclidean distance ratio of the longitudinal coordinates of points 38, 42, 39 and 41 of the eye and the transverse coordinates of points 37 and 40 of the eye to calculate the degree of eye-opening, as shown in Figure 4. Taking the left eye as an example, E AR is calculated as follows:  The mouth feature can also be used as an important basis for fatigue discrimination. Similar to the definition of AR E ,the Euclidean distance ratio M AR is calculated using the vertical coordinates of 51, 59, 53, and 57 at the mouth and the horizontal coordinates of 49 and 55 to determine the degree of mouth opening, as shown in Figure 5. The calculation formula is as follows:  When the driver's eyes are open, E AR fluctuates around a particular value to maintain dynamic equilibrium, whereas when the eyes are closed, E AR decreases rapidly. When E AR drops below a certain threshold, the human eye is in a closed state. The complex blink discrimination problem is transformed into calculating the Euclidean distance ratio of the eye feature points.
The mouth feature can also be used as an important basis for fatigue discrimination. Similar to the definition of E AR , the Euclidean distance ratio M AR is calculated using the vertical coordinates of 51, 59, 53, and 57 at the mouth and the horizontal coordinates of 49 and 55 to determine the degree of mouth opening, as shown in Figure 5. The calculation formula is as follows: Electronics 2023, 12, 26 7 of 20 AR AR vertical coordinates of 51, 59, 53, and 57 at the mouth and the horizontal coordinates of 49 and 55 to determine the degree of mouth opening, as shown in Figure 5. The calculation formula is as follows: M AR ||P -P ||+||P -P || = 2||P -P || (7)

Double-Threshold Fatigue Index Calculation
Head posture, eye closure and mouth opening are all important indicators to identify driver fatigue [31][32][33]. In this paper, by identifying the driver's head, eyes and mouth states, we use the double-threshold method to calculate four fatigue indicators, such as driver drowsy head nod frequency, fatigue blink frequency, yawn frequency and eye closure rate and then determine the driver's fatigue level.

Header Indicator
The head rotation angle is also a significant indicator of fatigue discrimination. When the driver is tired, the head will do a similar nodding or tilting posture, which mainly refers to the change of Pitch and Roll angles in the rotation vector but not much change in the Yaw direction. We can determine whether the driver has performed a head nodding or head tilting movement by comparing the angle of the Pitch or Roll change with the set threshold value.
In this paper, we choose pitch change to reflect the change of the driver's head posture, the amplitude of the driver's head movement when talking to people is different from the amplitude of drowsy nodding when fatigued, set the threshold to h T to distinguish the general head movement and drowsy nodding, and drowsy nodding will last for a period of time, set the threshold Hset F by double threshold comparison method can determine whether the driver's head movement is drowsy nodding, as shown in Equation (10).

Double-Threshold Fatigue Index Calculation
Head posture, eye closure and mouth opening are all important indicators to identify driver fatigue [31][32][33]. In this paper, by identifying the driver's head, eyes and mouth states, we use the double-threshold method to calculate four fatigue indicators, such as driver drowsy head nod frequency, fatigue blink frequency, yawn frequency and eye closure rate and then determine the driver's fatigue level.

Header Indicator
The head rotation angle is also a significant indicator of fatigue discrimination. When the driver is tired, the head will do a similar nodding or tilting posture, which mainly refers to the change of Pitch and Roll angles in the rotation vector but not much change in the Yaw direction. We can determine whether the driver has performed a head nodding or head tilting movement by comparing the angle of the Pitch or Roll change with the set threshold value.
In this paper, we choose pitch change to reflect the change of the driver's head posture, the amplitude of the driver's head movement when talking to people is different from the amplitude of drowsy nodding when fatigued, set the threshold to T h to distinguish the general head movement and drowsy nodding, and drowsy nodding will last for a period of time, set the threshold F Hset by double threshold comparison method can determine whether the driver's head movement is drowsy nodding, as shown in Equation (10).
The driver's general head movement Pitch shifts between 2 • and 8 • , while the Pitch amplitude is higher during drowsy head nodding, between 12 • and 20 • , as shown in Figure 6.
The driver's general head movement Pitch shifts between 2° and 8°, while the Pitch amplitude is higher during drowsy head nodding, between 12° and 20°, as shown in Fig

Fatigue Eye Closure Indicator
In this paper, a double-threshold comparison method is used to determine whether a driver's blink is a fatigue eye closure. Firstly, the average value of AR E of the left and right eyes of the current frame of the video is computed and compared with the set threshold E T , which in turn determines whether the eyes are open or closed. Secondly, when the eyes are closed in fatigue, the eyes are closed for a longer period of time, and the num-

Fatigue Eye Closure Indicator
In this paper, a double-threshold comparison method is used to determine whether a driver's blink is a fatigue eye closure. Firstly, the average value of E AR of the left and right eyes of the current frame of the video is computed and compared with the set threshold T E , which in turn determines whether the eyes are open or closed. Secondly, when the eyes are closed in fatigue, the eyes are closed for a longer period of time, and the number of frames F EC with eyes closed is compared with the set threshold F Eset , which in turn determines whether the driver is closing his eyes in fatigue. Yawning discrimination and drowsy nodding is the same, the E AR values of eyes open and closed are shown in Figure 7, and the E AR values of left and right eyes are calculated and averaged as the driver's real-time E AR value.
F EC ≥ F Eset (12) Figure 6. Magnitude of pitch variation in different cases.

Fatigue Eye Closure Indicator
In this paper, a double-threshold comparison method is used to determine whether a driver's blink is a fatigue eye closure. Firstly, the average value of AR E of the left and right eyes of the current frame of the video is computed and compared with the set threshold E T , which in turn determines whether the eyes are open or closed. Secondly, when the eyes are closed in fatigue, the eyes are closed for a longer period of time, and the number of frames EC F with eyes closed is compared with the set threshold Eset F , which in turn determines whether the driver is closing his eyes in fatigue. Yawning discrimination and drowsy nodding is the same, the AR E values of eyes open and closed are shown in Figure   7, and the AR E values of left and right eyes are calculated and averaged as the driver's real-time AR E value.

Yawning Indicator
Since there is a more obvious change in the mouth in the behaviors of yawning, talking, and eating, the threshold Tm is set to distinguish the change in mouth length-width ratio between yawning and other situations, similar to judging fatigue eye closure, and the double threshold comparison method is used to determine whether the driver is yawning, as shown in Equations (13) and (14), to prevent misjudgment.
After testing, the M AR threshold fluctuated approximately 0.35 when the driver shut up, 0.5 when talking, and 0.73 when yawning, as shown in Figure 8.
After testing, the M AR threshold fluctuated approximately 0.35 when the driver shut up, 0. 5 when talking, and 0.73 when yawning, as shown in Figure 8.

Eye Closure Rate (PERCLOS)
The eye closure rate is the percentage of total time spent in eye closure per unit of time. PERCLOS (percentage of eyelid closure over the pupil over time) is one of the most important indicators of driver fatigue. PERCLOS usually has three measurement standards P70, P80, and EM. According to studies, P80 has the strongest correlation with the degree of fatigue, the proportion of time that the eyelid covers more than 80% of the region. Under the condition that the frame rate of the image captured by the camera is constant per unit of time, PERCLOS can be considered as the ratio of the number of frames of eye closure per unit of time EC F to the total number of frames Total F . The calculation formula is as follows.

Fatigue Recognition Algorithm with Multi-Feature Fusion
Through the previous subsection, narrative can be seen, and the use of Raspberry Pi can be real-time detection of the input video, driver fatigue closed eyes, yawning, drowsy nodding, PERCLOS, and other indicators. With the help of the double threshold comparison method can be counted in the driver unit time T fatigue closed eyes, yawning, and drowsy nodding the number of times N.
The formula for calculating the frequency of fatigue eye closure per unit time T is as follows:

Eye Closure Rate (PERCLOS)
The eye closure rate is the percentage of total time spent in eye closure per unit of time. PERCLOS (percentage of eyelid closure over the pupil over time) is one of the most important indicators of driver fatigue. PERCLOS usually has three measurement standards P70, P80, and EM. According to studies, P80 has the strongest correlation with the degree of fatigue, the proportion of time that the eyelid covers more than 80% of the region. Under the condition that the frame rate of the image captured by the camera is constant per unit of time, PERCLOS can be considered as the ratio of the number of frames of eye closure per unit of time F EC to the total number of frames F Total . The calculation formula is as follows.

Fatigue Recognition Algorithm with Multi-Feature Fusion
Through the previous subsection, narrative can be seen, and the use of Raspberry Pi can be real-time detection of the input video, driver fatigue closed eyes, yawning, drowsy nodding, PERCLOS, and other indicators. With the help of the double threshold comparison method can be counted in the driver unit time T fatigue closed eyes, yawning, and drowsy nodding the number of times N.
The formula for calculating the frequency of fatigue eye closure per unit time T is as follows: The formula for calculating the frequency of yawning per unit of time T is as follows: The formula for calculating the frequency of drowsy nodding per unit of time T is as follows: Fatigue detection requires the fusion of four indicators, and given the different magnitudes of the four fatigue indicators, a comprehensive evaluation of the fatigue degree requires the normalization of the data and the normalized conversion of the data using the inverse tangent function, as shown in Equation (19). The normalized data are provided in Table 1.

Fatigue Indicators Value Normalization
Frequency of eye closure for fatigue F blink /(Times · S −1 ) F blink Yawning frequency F Yawn /(Times · S −1 ) F Yawn Sleepy nod frequency F Nod /(Times · S −1 ) F Nod PERCLOS P / % P According to the influence of each index on the driver fatigue degree set different weights, the comprehensive evaluation index F of the fatigue degree can be calculated as follows:

Convolutional Neural Network
The convolutional neural network evolved from the traditional multilayer neural network development, adding a convolutional layer, a pooling layer, and a feature extraction part to reduce the training parameters while effectively extracting feature information and reducing the complexity of the network. The fully connected layer is used to calculate the loss and obtain the classification results. In this paper, the driver emotion recognition model is trained in an improved RM-Xception convolutional neural network.

Improved RM-Xception Emotion Recognition Algorithm
The improved RM-Xception emotion recognition algorithm, firstly, in the selection of activation function, chooses the commonly used activation function RELU in neural networks, also known as modified linear unit, as shown in Equation (21). It does not require exponential operations and has a small computation; it does not have gradient disappearance and can effectively solve the gradient saturation problem.
Next, the RM-Xception network is lightly processed, and the overall parameters of the network body are 75,143 among which 73,687 parameters are used for training. The overall structure of the network is shown in Figures 9 and 10, which is divided into three parts: Entry flow, Middle flow, and Exit flow. Firstly, the Entry flow part performs 3 × 3 convolution on the face images of the input network and batch normalization after the activation of the relu function. The relu function and batch normalization operation can reduce the data divergence and further enhance the nonlinear expression capability of the model; secondly, the Middle flow part sends the convolved data into four depthseparable convolution modules with direct residual connection, and each convolution module performs 3 times of depth-separable convolution, activation, batch normalization and then 1 × 1 convolution with direct residual connection; finally, the Exit flow part sends the output of the last module after 1 × 1 convolution and global mean pooling operation. Then sent to the SoftMax classifier to classify and get seven emotions: angry, disgusted, scared, happy, sad, surprised, and neutral. of the model; secondly, the Middle flow part sends the convolved data into four depthseparable convolution modules with direct residual connection, and each convolution module performs 3 times of depth-separable convolution, activation, batch normalization and then 1×1 convolution with direct residual connection; finally, the Exit flow part sends the output of the last module after 1×1 convolution and global mean pooling operation. Then sent to the SoftMax classifier to classify and get seven emotions: angry, disgusted, scared, happy, sad, surprised, and neutral.

Time Series-Based Emotional Fatigue Feature Fusion Algorithm
When we judge the driver's emotion and fatigue state based on a single frame, there is generally a high degree of uncertainty. This can affect the precision of the system. separable convolution modules with direct residual connection, and each convolution module performs 3 times of depth-separable convolution, activation, batch normalization and then 1×1 convolution with direct residual connection; finally, the Exit flow part sends the output of the last module after 1×1 convolution and global mean pooling operation. Then sent to the SoftMax classifier to classify and get seven emotions: angry, disgusted, scared, happy, sad, surprised, and neutral.

Time Series-Based Emotional Fatigue Feature Fusion Algorithm
When we judge the driver's emotion and fatigue state based on a single frame, there is generally a high degree of uncertainty. This can affect the precision of the system.

Time Series-Based Emotional Fatigue Feature Fusion Algorithm
When we judge the driver's emotion and fatigue state based on a single frame, there is generally a high degree of uncertainty. This can affect the precision of the system. Emotional and fatigue states are usually expressed as a process, and the driver's state cannot be determined by a single frame of facial information alone. Therefore, this paper constructs a driver state recognition method based on the fusion of emotional and fatigue features in time series, capturing the contextual information of the input video sequence to achieve accurate recognition of the driver condition.
In the field of emotion recognition, psychologists led by Ekman classified people's basic emotions into six categories, namely, happiness, anger, sadness, disgust, fear, and surprise, to which emotions, such as neutrality were later added [34]. In this paper, the calculation of emotions is based on the above seven categories of emotions. When the driver is tired, the emotions flowing from the face are mostly neutral, while tension and fear appear to distract the driver's attention, and anger may cause road rage and increase the driver's safety risk. Therefore, the two modalities of the driver's facial emotion and fatigue status are identified and fused, and the driver's status level is classified according to the score of the fusion index and then the driver is warned in advance or actively intervened.
When the driver is in a happy mood, the auxiliary safety driving system does not need to intervene; when the driver expresses emotions, such as anger, fear, and dread, the system needs to identify and intervene in time. Therefore, according to the actual situation in which the auxiliary safety system needs to interfere and the impact of different emotions on the driver, the seven emotions are scored, and the scoring table is shown in Table 2. According to the real-time sentiment scoring table, the sentiment score of each frame is obtained in real-time, and the cumulative score NScore in T time is calculated based on the captured video sequence context information and then the sentiment score S T per unit time is derived.
Fusing the mood score with the composite index score of fatigue, the equation was obtained as follows: The status of the driver is divided into four levels according to the score: suitable for driving, lower risk, higher risk, and unsuitable for driving, as indicated in Table 3.

Experimental Platform and Dataset
In this experimental environment, the hardware configuration for neural network training is an Intel(R) Core(TM) i5-8300H CPU, an NVIDIA GeForce GTX 1050Ti GPU, and Windows 10. The hardware platform on which the system runs is a Raspberry Pi.
The dataset used for emotion recognition is FER2013, which was proposed in the Kaggle facial expression analysis competition and contains 28,709 training samples of seven emotions: angry, disgusted, scared, happy, sad, surprised, and neutral, and 3859 validation and test sets. The human eye recognition rate of this database is roughly between 60% and 70%.
Since there is no publicly available dataset for fatigue detection, the YAWDD dataset is chosen as the sample for experimental validation [35]. The YAWDD dataset is a video dataset recorded by the in-car camera, which records the real-time states of the driver chatting, silence, yawning, etc. In the car under different lighting conditions, as shown in Figure 11. validation and test sets. The human eye recognition rate of this database is roughly be-tween 60% and 70%.
Since there is no publicly available dataset for fatigue detection, the YAWDD dataset is chosen as the sample for experimental validation [35]. The YAWDD dataset is a video dataset recorded by the in-car camera, which records the real-time states of the driver chatting, silence, yawning, etc. In the car under different lighting conditions, as shown in Figure 11.

Fatigue Detection Experiment
The video stream in YawDD is randomly selected to detect the fatigue characteristic indexes of the driver's eyes, mouth, and head, and the results are recorded in Tables 4 and  5, and the fatigue state of the driver is comprehensively evaluated according to the actual data of the experiment, and the higher the fatigue comprehensive index of fatigue degree represents the more tired is the driver, as shown in Table 6. Table 4. Accuracy of eye fatigue index detection. To verify the accuracy of the proposed fatigue detection method, the states of the driver's eyes, mouth, and head were recorded on the Raspberry Pi for a specified number of frames in this paper, as shown in Figure 12. Figure 12a records the frames of the driver's eye-opening and fatigue closing, where 1 represents eye-opening and 0 represents eyeclosing; Figure 12b records the duration of eye-opening and eye-closing, where 1 represents eye-opening and 0 represents eye-closing, from the figure, the longest frame of driver's eye closing is 160 frames, which is approximately 10 s, and the driver is in the state of severe fatigue almost sleeping; Figure 12c records the state of driver's mouth, where 0 represents the mouth closed, 1 represents the driver talking, 2 represents the driver yawning; Figure 12d records the driver's head posture, where 0 represents the driver's head in a normal posture, 1 represents a small movement of the head, 2 represents the driver nodding in sleep. After testing, the actual data curve and the recorded data curve almost overlap, which indicates that the fatigue detection method is more accurate.  To verify the accuracy of the proposed fatigue detection method, the states of the driver's eyes, mouth, and head were recorded on the Raspberry Pi for a specified number of frames in this paper, as shown in Figure 12. Figure 12a records the frames of the driver's eye-opening and fatigue closing, where 1 represents eye-opening and 0 represents eyeclosing; Figure 12b records the duration of eye-opening and eye-closing, where 1 represents eye-opening and 0 represents eye-closing, from the figure, the longest frame of driver's eye closing is 160 frames, which is approximately 10 s, and the driver is in the state of severe fatigue almost sleeping; Figure 12c records the state of driver's mouth, where 0 represents the mouth closed, 1 represents the driver talking, 2 represents the driver yawning; Figure 12d records the driver's head posture, where 0 represents the driver's head in a normal posture, 1 represents a small movement of the head, 2 represents the driver nodding in sleep. After testing, the actual data curve and the recorded data curve almost overlap, which indicates that the fatigue detection method is more accurate.

Emotion Recognition Experiment
In this paper, we use the Fer2013 dataset to train the neural network model, whose data volume is minimal. To strengthen the robustness of the training network, this paper performs data augmentation on the Fer2013 dataset. Data enhancement means artificially

Emotion Recognition Experiment
In this paper, we use the Fer2013 dataset to train the neural network model, whose data volume is minimal. To strengthen the robustness of the training network, this paper performs data augmentation on the Fer2013 dataset. Data enhancement means artificially flipping, cutting, and rotating the images. Common data enhancement methods include rotation and cropping, color dithering, and so on. In this paper, we set the range of random image rotation to 10 degrees and the range of random scaling to 0.1 without decentering and normalization.
In the process of training, the batch size of sample training is set to 64, the total number of training rounds (epchos) is set to 200, the number of classifications is set to 7, and the Adam optimization algorithm is selected as the optimizer to lower the loss. Convergence speed is fast and the learning effect is good. As the number of training iterations increases, the accuracy of the improved RM-Xception network recognition also increases, and after 127 Epochs, the accuracy reaches 73.32%, as shown in Figure 13. The iteration loss is displayed in Figure 14.
(c) State of the mouth ( d) Head state

Emotion Recognition Experiment
In this paper, we use the Fer2013 dataset to train the neural network model, whose data volume is minimal. To strengthen the robustness of the training network, this paper performs data augmentation on the Fer2013 dataset. Data enhancement means artificially flipping, cutting, and rotating the images. Common data enhancement methods include rotation and cropping, color dithering, and so on. In this paper, we set the range of random image rotation to 10 degrees and the range of random scaling to 0.1 without decentering and normalization.
In the process of training, the batch size of sample training is set to 64, the total number of training rounds (epchos) is set to 200, the number of classifications is set to 7, and the Adam optimization algorithm is selected as the optimizer to lower the loss. Convergence speed is fast and the learning effect is good. As the number of training iterations increases, the accuracy of the improved RM-Xception network recognition also increases, and after 127 Epochs, the accuracy reaches 73.32%, as shown in Figure 13. The iteration loss is displayed in Figure 14.  In the classification problem, Precision indicates the probability of actual positive samples among the positive samples predicted by the classifier; Recall indicates the probability of being correctly predicted as positive samples among all positive samples. Precision and recall are calculated as follows: Recall = TP TP + FN (25) where TP indicates that the sample is positive and the prediction result is also positive, FP indicates that the sample is negative but the prediction result is positive, and FN indicates that the sample is positive but the prediction result obtained is negative. The accuracy obtained by the method in this paper is 80.82% and the recall rate is 63.01% after 127 rounds of iterative calculation. The performance of the method in this paper is compared with other methods in the Table 7.

Driver Status Detection Experiment
To verify the merits of the time-series-based emotional fatigue feature fusion model, six video streams in the YawDD dataset were randomly selected for driver real-time state detection in this paper, as shown in Figure 15. The detection system calculates the driver fatigue and emotional scores in a single frame, based on the time-series accumulation, and calculates the driver state scores per unit of time T according to the method mentioned above. The system operates at a frame rate of approximately 4fps on the Raspberry Pi 4B. Six video sequences were tested in the experiment for one minute of driver status data, the interface of the test system is presented in Figure 16, and the test data are presented in Table 8. the interface of the test system is presented in Figure 16, and the test data are presented in Table 8.       From the experimental data, it can be seen that among the tested video sequences, two have a comprehensive driver state score of less than 0.01, which is suitable for driving; three have a comprehensive driver score between 0.01 and 0.02, which has a low driving risk; and one has a driver state score higher than 0.03, which is judged unsuitable for driving by the system. The predicted driver status of the system matches the actual status, proving that the model can truly reflect the driver's status. The video sequence with test number 3 has a negative emotional score, indicating that the driver is in a happy state during that test time, while the driver's eyes close to a certain extent during the process of laughing. If only the fatigue state is identified for the driver, the system is likely to misjudge it as a mild risk, but when combined with the emotion recognition, the system predicts it as suitable for driving, which better expresses the driver's real driving state. The facial state of driver fatigue and emotional performance are inextricably linked. When some drivers are happy, the eye closure increases, which will increase the risk of system misjudgment; in a sad or neutral state, the corresponding fatigue indicators will also increase. Combining the analysis of emotion and fatigue state will more accurately express the driver's current state and increase the robustness of driver state identification. The integrated state indicators will not affect the system's determination due to the level of individual indicators and are more adaptable to complex driving environments.
This study realizes real-time state monitoring of driver emotion and fatigue, which can effectively prevent safety accidents caused by driver fatigue and emotion fluctua-tions. The experimental data can more realistically and accurately reflect the driver's state, contributing to future research in the field of assisted safe driving.

Conclusions
This paper implements a time series-based driver fatigue and emotion recognition algorithm for accurate detection of the driver's real-time status with some robustness for complex driving environments. The main findings are as follows: First, a dual-threshold fatigue recognition algorithm with multi-feature fusion is proposed, which can greatly reduce the external environment's interference and improve the system's accuracy in determining driver fatigue by graying and histogram equalization of the face images captured by the micro camera. Following that, the driver's head posture, and eye and mouth states are recognized, fatigue indicators are calculated, and the fatigue composite score obtained by fusion according to mathematical methods can rapidly and accurately reflect the driver's fatigue level.
Second, the improved RM-Xception algorithm, which introduces a depth-separable convolution module and a residual module, lightens the Xception processing, significantly reduces the computational resources required to train the network, and trains a model with high emotion extraction capability. Meanwhile, by training the model using the data-enhanced images, the obtained network has better robustness, and the model finally achieves 73.32% accuracy on the Fer2013 dataset. Based on the time series to calculate the driver's emotional indicators over a period of time, it can truly reflect the driver's emotional state at a certain time and make a certain contribution to the field of assisted safe driving in the future.
In future work, this paper will test and improve the driving state recognition algorithm in more realistic scenarios and complex environments, consider combining driving data collected by multiple sensors, consider the applicability of the algorithm under different light intensities, and further investigate the relationship between a driver's facial state and his or her emotions when fatigued.