Deep Coupling Recurrent Auto-Encoder with Multi-Modal EEG and EOG for Vigilance Estimation

Vigilance estimation of drivers is a hot research field of current traffic safety. Wearable devices can monitor information regarding the driver’s state in real time, which is then analyzed by a data analysis model to provide an estimation of vigilance. The accuracy of the data analysis model directly affects the effect of vigilance estimation. In this paper, we propose a deep coupling recurrent auto-encoder (DCRA) that combines electroencephalography (EEG) and electrooculography (EOG). This model uses a coupling layer to connect two single-modal auto-encoders to construct a joint objective loss function optimization model, which consists of single-modal loss and multi-modal loss. The single-modal loss is measured by Euclidean distance, and the multi-modal loss is measured by a Mahalanobis distance of metric learning, which can effectively reflect the distance between different modal data so that the distance between different modes can be described more accurately in the new feature space based on the metric matrix. In order to ensure gradient stability in the long sequence learning process, a multi-layer gated recurrent unit (GRU) auto-encoder model was adopted. The DCRA integrates data feature extraction and feature fusion. Relevant comparative experiments show that the DCRA is better than the single-modal method and the latest multi-modal fusion. The DCRA has a lower root mean square error (RMSE) and a higher Pearson correlation coefficient (PCC).


Introduction
The fatality rate of traffic accidents is very high. According to statistics, millions of people die from traffic accidents every year. Tired driving and inattention are the main causes of traffic accidents. Modern sensor technology has been widely used in driver condition monitoring, and has reduced traffic accidents to a certain extent and saved thousands of lives [1].
A portable wearable device can collect electroencephalography (EEG) and electrooculography (EOG) signals, which are used to evaluate the driver's state in real time [2][3][4]. EEG signals can directly reflect the activity of the human brain and capture the changes of radio waves caused by fatigue or drowsiness [5]. EEG is a promising neurophysiological indicator that has distinguished wakefulness from sleeping in various studies. EOG measures the potential between the front and back positions of human eyes, which contains information about vigilance and eye movement, the latter of which is an effective indicator of human psychological activities. EEG and EOG signals come from different sensors, and such data are called multi-modal data. Multi-modal fusion methods include the multi-core method, the graph model, and the neural network method [6]. In recent years, multi-modal data fusion has attracted extensive attention [6][7][8], and the fusion of vibration signals and acoustic signals with different attributes and characteristics provides better fault diagnosis results [9,10]. In the field of artificial intelligence, the fusion of image, sound, text, and video is a current research hotspot [11][12][13].

•
The joint loss function uses Euclidean distance similarity metrics in a single mode, and the multi-modal loss is measured by a Mahalanobis distance of metric learning [22,23], which can effectively reflect the distance between different modal data so that the distance between different modes can be described more accurately in the new feature space based on the metric matrix, and the losses of the two modes are summed according to weights. • Compared to the latest fusion method and the single-modal method, the method proposed in this paper has a lower root mean square error (RMSE) and a higher Pearson correlation coefficient (PCC).
The remainder of this paper is organized as follows: In Section 2, the auto-encoder and metric learning are described. The deep recurrent auto-encoder is then extended to a deep coupling recurrent auto-encoder and a combinational model. The experimental data and evaluation methods are introduced in Section 3. Section 4 describes the experimental results and compares the performance of different models. Conclusions are presented in Section 5.

Auto-Encoder
An auto-encoder is an unsupervised symmetric structure neural network model [24]. It consists of two parts-an encoder that converts inputs into potential representations, and a decoder that converts internal representations into outputs. An auto-encoder usually has the same number of neurons in the input and output layers. If the number of neurons in the hidden layer is smaller than that in the input and output layers, the hidden layer needs to learn the most important features of the input data and delete the unimportant features. The output layer is often called reconstruction, and the auto-encoder attempts to reconstruct the input through the loss function so that the input and output are as similar as possible. The auto-encoder encoding function f and decoding function g are as follows: where w (1) and w (2) are weight matrices, b (1) and b (2) are bias vectors, and S 1 and S 2 are nonlinear activation functions. The objective of an auto-encoder is to optimize w and b to minimize the reconstruction error. Traditionally, the mean squared error or cross entropy is used to compute the reconstruction error. The mean squared error function is l MSE (x,z) = x − z 2 2 , and the cross entropy function is The auto-encoder minimizes the objective function by optimizing w and b to maximize the similarity of the reconstructed data input.

Metric Learning
Mahalanobis distance was proposed by Indian statisticians in the literature [22] to represent the covariance distance of data. Compared with Euclidean distance, Mahalanobis distance has some excellent properties of its own-it is independent of decoupling and the dimension. The traditional Mahalanobis distance based on the inverse of the covariance matrix is usually used to reflect the internal aggregation relationship of data. However, in many classification tasks, it is not enough to measure the distance function only to reflect the internal aggregation relationship of data, and it is more important to establish the relationship between sample attributes and categories. In the study of metric learning, Mahalanobis distance is no longer simply limited to the inverse of the covariance matrix, Entropy 2021, 23, 1316 4 of 14 but needs to be obtained through the process of metric learning [25][26][27]. If there are two sequences x i and x j , then a semi-positive definite matrix M is given, called the Mahalanobis distance, expressed as follows: Distance metric learning refers to using a given training sample set to learn a metric matrix that can effectively reflect the distance between data so that in the new feature space based on the metric matrix, the distribution between similar samples is more compact, while the distribution between different samples is more distant. Metric learning is to learn M. In order to ensure a non-negative distance and satisfy the triangle inequality, M must be a (semi) positive definite symmetric matrix, that is, there is an orthogonal basis P so that M can be written as M = PP T . Commonly used distance measurement learning methods include the probabilistic global distance metric (PGDM) [26], the large margin nearest neighbor (LMNN) [28], and information-theoretic metric learning (ITML) [29]. In this study, Mahalanobis distance was used to measure the similarity between multiple modes.

Deep Coupling Recurrent Auto-Encoder (DCRA)
The deep recurrent auto-encoder extracts deep features, which is the basis of feature fusion. The coupling layer connects two single auto-encoders together. The DCRA integrates feature representation and feature fusion.

Coupling Auto-Encoder
Multi-modal EEG and EEG information is closely related and complementary. It is one of the strategies of multi-modal fusion to strengthen the common features of multimodal data and weaken the individual features of multi-modal data. The structure of the coupling auto-encoder is shown in Figure 1. The coupling auto-encoder consists of two auto-encoders of the same structure. The inputs of the two auto-encoders are EEG and EOG data, and the model reconstruction is consistent with the inputs. The coupling layer fuses two single-modal features together through a joint objective loss function [11]. Although the two auto-encoders have the same structure, the parameters of the two auto-encoders are different because of different inputs.
Distance metric learning refers to using a given training sample matrix that can effectively reflect the distance between data so tha space based on the metric matrix, the distribution between similar sa pact, while the distribution between different samples is more distan to learn M. In order to ensure a non-negative distance and satisfy the M must be a (semi) positive definite symmetric matrix, that is, there is P so that M can be written as M = PP T . Commonly used distance me methods include the probabilistic global distance metric (PGDM) [2 nearest neighbor (LMNN) [28], and information-theoretic metric lear this study, Mahalanobis distance was used to measure the similarit modes.

Deep Coupling Recurrent Auto-Encoder (DCRA)
The deep recurrent auto-encoder extracts deep features, which i fusion. The coupling layer connects two single auto-encoders togeth grates feature representation and feature fusion.

Coupling Auto-Encoder
Multi-modal EEG and EEG information is closely related and c one of the strategies of multi-modal fusion to strengthen the commo modal data and weaken the individual features of multi-modal data. coupling auto-encoder is shown in Figure 1. The coupling auto-enc auto-encoders of the same structure. The inputs of the two auto-en EOG data, and the model reconstruction is consistent with the inputs fuses two single-modal features together through a joint objective lo hough the two auto-encoders have the same structure, the paramet encoders are different because of different inputs.  In this study, a joint objective loss function was designed to trai encoder [10,11]. The joint loss function is shown in Formula (5). The composed of three parts: EEG loss LE, EOG loss LO, and multi-modal single-modal losses. Considering the internal correlation of a single m In this study, a joint objective loss function was designed to train the coupling autoencoder [10,11]. The joint loss function is shown in Formula (5). The joint loss function is composed of three parts: EEG loss L E , EOG loss L O , and multi-modal loss S. L E and L O are single-modal losses. Considering the internal correlation of a single mode, a single mode uses Euclidean distance to measure similarity, as shown in Formulas (6) and (7), z E and z O are the reconstruction data from the inputs x E and x O . The use of Euclidean distance for multi-modal loss S between two different modes cannot fully reflect the internal relationship between them. In order to learn the internal relations between the two modes, Mahalanobis distance was used to measure different modes, and this distance was obtained through metric learning. The Mahalanobis distance of metric learning can reflect the internal relations between different modes and express the differences between them [26,27]. As shown in Formula (8), M is the Mahalanobis distance of metric learning.
In Formula (5), are auto-encoder loss functions of EEG and EOG, respectively; θ E and θ O represent parameters of corresponding models. α is the weight of multi-modal loss in the joint loss function. If α = 0, the joint loss function degrades to the loss function of a single auto-encoder, which cannot capture any correlation between inputs from different modes. If α = 1, any pair of multi-modal inputs has a correlation. In short, the loss function only focuses on the constraint of correlation and completely ignores the characteristics of the data.
In Formula (8), represent the auto-encoder mapping functions of EEG and EOG, and M in Formula (8) is the Mahalanobis distance for metric learning. In this study, the probabilistic global distance metric learning (PGDM) method was used to learn the Mahalanobis matrix [26]. This method transforms the metric learning method into a convex optimization problem with constraints, and takes the selected pair constraint as the constraint condition of training samples. The dominant idea was to minimize the distance of sample pairs of the same category while constraining the distance between different samples of the same category that were greater than a certain value. The optimization model is as follows: min If M is the Mahalanobis distance to be learned, to minimize the sum of the squares of x i and x j distances for any homogeneous x i and x j , the limiting condition is that the distance between x i and x j of different classes is greater than 1 and M is semi-positive. In Formula (9), S is the same classes of data, and D is the different classes of data. The loss function of PGDM is expressed as the following: This loss function is equivalent to the optimization model, which is a convex optimization problem and can be directly optimized by Newton and quasi-Newton methods. Compared with Euclidean distance, Mahalanobis distance can describe the relationship between two different modes more accurately.

Deep Coupling Recurrent Auto-Encoder (DCRA)
A deep coupling recurrent auto-encoder can extract deep features, which are the basis of feature fusion. In this paper, a deep coupling recurrent auto-encoder (DCRA) is proposed. Its structure is shown in Figure 2. DCRA encoding and decoding are composed of three layers of gated recurrent units (GRUs), and the coupling layer connects two independent auto-encoders. The DCRA is trained by the joint objective loss function, as expressed by Formula (5). posed. Its structure is shown in Figure 2. DCRA encoding and decodi three layers of gated recurrent units (GRUs), and the coupling layer pendent auto-encoders. The DCRA is trained by the joint objective pressed by Formula (5). A recurrent neural network (RNN) is a neural network with shor parameter sharing. The nodes between the hidden layers in the stru neural network are connected, and the input of the hidden layer in output of the input layer but also the output of the hidden layer at th structure focuses on the relevance of data before and after, and is par video, voice, text, and other time-series-related problems. As shown current neural network is connected not only between adjacent laye hidden layers. In Figure 3, at each time step t, neuron y(t) receives input vector x( y(t-1) of the previous time step, and transmits backwards, step-by-st Formula (11): (11), Wx and Wy are the weight matrices of input x(t) A recurrent neural network (RNN) is a neural network with short-term memory and parameter sharing. The nodes between the hidden layers in the structure of a recurrent neural network are connected, and the input of the hidden layer includes not only the output of the input layer but also the output of the hidden layer at the last moment. This structure focuses on the relevance of data before and after, and is particularly suitable for video, voice, text, and other time-series-related problems. As shown in Figure 3, the recurrent neural network is connected not only between adjacent layers but also between hidden layers. sis of feature fusion. In this paper, a deep coupling recurrent auto-encoder (DCRA) posed. Its structure is shown in Figure 2. DCRA encoding and decoding are compo three layers of gated recurrent units (GRUs), and the coupling layer connects tw pendent auto-encoders. The DCRA is trained by the joint objective loss function pressed by Formula (5). A recurrent neural network (RNN) is a neural network with short-term memo parameter sharing. The nodes between the hidden layers in the structure of a rec neural network are connected, and the input of the hidden layer includes not o output of the input layer but also the output of the hidden layer at the last momen structure focuses on the relevance of data before and after, and is particularly suita video, voice, text, and other time-series-related problems. As shown in Figure 3, current neural network is connected not only between adjacent layers but also b hidden layers. In Figure 3, at each time step t, neuron y(t) receives input vector x(t) and output y(t-1) of the previous time step, and transmits backwards, step-by-step, as expres Formula (11): In Formula (11), Wx and Wy are the weight matrices of input x(t) and y(t − 1), bias vector, and ø is the activation function. The parameters of the recurrent neur work are learned by the back propagation algorithm, that is, the errors are passed fo step-by-step, in the reverse order of time. Because the data transforms when trav the RNN, some information is lost at each time step, resulting in a worse final result In Figure 3, at each time step t, neuron y(t) receives input vector x(t) and output vector y(t − 1) of the previous time step, and transmits backwards, step-by-step, as expressed by Formula (11): In Formula (11), W x and W y are the weight matrices of input x(t) and y(t − 1), b is the bias vector, and φ is the activation function. The parameters of the recurrent neural network are learned by the back propagation algorithm, that is, the errors are passed forward, stepby-step, in the reverse order of time. Because the data transforms when traversing the RNN, some information is lost at each time step, resulting in a worse final result. Longand short-term memory (LSTM) has been proposed to solve this problem [30][31][32]. The unique structure of LSTM can detect medium and long-term dependence of data. A GRU is a simplified version of an LSTM recurrent neural network [33]. The structure of a GRU is shown in Figure 4. We used a GRU in this study to design the recurrent auto-encoder. In Figure 4, z(t) controls the forgetting gate and the input gate. Some parts of the forgetting gate that control the long-term state should be deleted. The input gate control should add some parts of g(t) to the long state. If the gate controller outputs 1, the forgetting gate opens and the input gate closes. If 0 is the output, the opposite is true. Gate controller r(t) controls the display of a portion of the previous state to the main output layer g(t). g(t) is the main output layer, whose function is to analyze the current input x(t) and the previous state h(t − 1), store the most important part in the long-term state, and its output goes directly to h(t).
In Formulas (12)- (14), W xz , W xr , and W xg are the weight matrices of each layer in 3 layers connected with input vector x(t). W hz , W hr , and W hg are the weight matrices of the connection between each layer and the previous short-term state h(t − 1) in the three layers. b z , b r , and b g are the offset terms of each of the three layers. GRUs can learn to identify important inputs, store them in a long-term state, and extract them when needed, which is the advantage of a GRU processing time-series data.
DCRA model training is stratified from bottom to top. The DCRA is able to learn similar features from different modes by mapping multi-mode signals into the same representation space. The DCRA algorithm is shown below (Algorithm 1).

SEED-VIG [2]
is a vigilance evaluation dataset collected in a simulation experiment by Lu Baoliang's research group. In the experiment, a neuroscan system and eye-tracking glasses were used to record the real-time data of the tester. The neuroscan system recorded EEG and EOG data, and the eye-tracking glasses recorded eye movement data, which included saccade, fixation, blinking, and closing eyes. Testers needed to drive the simulated car in a virtual environment for 120 min without any warning or interference during driving. Testers were required to test after lunch because drivers are prone to drowsiness at this time. SEED-VIG recorded data from 23 testers. The testers included 12 women and 11 men. The eye-tracking glasses accurately captured information about their eye movements. Eye movement-based PERCLOS [34] is considered to be the most reliable and effective method for measuring driver alertness levels. PERCLOS is equal to (blink + eyes CLOSures(CLOS))/(blink + fixation + saccade + eyes CLOSures(CLOS)). This is a widely accepted indicator of alertness.

Evaluation Methods
PERCLOS is suitable for regression analysis, and the root mean square error (RMSE) is a common evaluation method for regression models [35]. The RMSE uses the square error of the real value y and the predicted valueŷ to evaluate the model, and the RMSE formula is as follows: The RMSE does not provide some structural information, while the Pearson correlation coefficient (PCC) provides an assessment of the linear relationship between the predicted and true values. The PCC is used as a complement to the RMSE in related models for assessing EEG and EOG. The formula of PCC is as follows: y = (y 1 , y 2 , · · · , y n ) T is the actual measured PERCLOS andŷ = (ŷ 1 ,ŷ 2 , · · · ,ŷ n ) T is the model-predicted value. y andŷ are the means of the real and predicted values, respectively. When the model is more accurate, the PCC value is larger and the RMSE value is smaller.

Comparison Method
In this study, eight feature level fusion methods were selected for comparison, including SVR, CCNF and CCRF [2], DAE [3], GELM [15], LSTM [16], DNNSN [17], and LSTM-CapsAtt [14]. The DNNSN proposes a double-layer neural network with subnetwork nodes. The model is composed of multiple subnet nodes, and each node is composed of many hidden nodes with various feature selection capabilities. The DNNSN demonstrated good results in an experiment. The LSTM-CapsAtt uses a capsule attention model and a deep LSTM network to fuse EEG and EOG. The capsule attention model uses LSTM and capsule feature representation layers to learn temporal and hierarchical/spatial de-Entropy 2021, 23, 1316 9 of 14 pendencies in the data. Experiments have shown that the LSTM-CapsAtt achieves better results than the previous seven methods.
The DCRA was realized using the TensorFlow and Keras frameworks. In order to make the DCRA model more accurate, parameters, such as the number of layers of the DCRA model, the number of neurons in each layer, the activation function, the optimization function, and the learning rate, were debugged several times, and the most suitable parameter collocation was selected to achieve the best effect. The model parameter settings are shown in Table 1. Before the experiment, the data was first standardized so that the data of different modalities were in the same range and equally distributed. In the experiment, we found that adding the batch normalization layer can improve the model performance, so the batch normalization layer was added after second, third, fifth and sixth layers of the model. The batch normalization layer realizes batch normalization, which has a positive effect on the deep network [36]. The learning rate and batch size were adjusted multiple times simultaneously, and the learning rate increased from 10 −5 to 10, each time by 10 times. The batch size ranged from 16 to 256 and increased twofold each time. Finally, when the learning rate was 0.001 and the batch size was 32, the model performance was optimal and the convergence speed was the fastest. The activation function selected Relu and Sigmoid, and different layers selected different activation functions. Regarding the choice of optimizer, we tried AdaGrad, RMSProp, and Adam, and finally chose the more suitable Adam, which was faster than the other two optimizers.

Learning Mahalanobis Distance
In order to obtain the Mahalanobis distance M, some data was selected from the SEED-VIG dataset, and feature data was extracted with a single recurrent auto-encoder model to achieve data dimension reduction. Then, these characteristic data and labels were processed by the PGDM method for training and learning the Mahalanobis distance M. In this study, the PGDM algorithm was realized in MATLAB (Version 2016, MathWorks, Natick, MA, USA).
In reference [25], Mahalanobis distance and Euclidean distance were compared in UCR datasets, and the results show that Mahalanobis distance is smaller than Euclidean distance. These experiments show that Mahalanobis distance is more accurate than Euclidean distance in measuring the distance similarity of different modes.

Performance Analysis
In order to verify the effectiveness of the DCRA algorithm under different distance measures, DCRA_E, based on Euclidean distance, and DCRA_M, based on Mahalanobis distance, were compared with the other eight latest fusion methods. We employed five-fold cross-validation for the data, and no overlap existed between the testing and training data. As shown in Table 2, we used different algorithms to evaluate the values of RMSE and PCC on multi-modal EEG and EOG.
As can be seen in Table 2, the RMSEs of the time-dependent CCRF and CCNF methods were 0.10 and 0.095, respectively, and the PCCs 0.84 and 0.845, respectively. The DAE was almost the same as the CCNF without significant change. Although the LSTM cyclic neural network achieved an adequate PCC, the RMSE performance significantly reduced its actual performance. At the same time, we observed the fact that the combination of the ELM-based GELM model and the auto-encoder significantly improved the performance because it reduced the dimension of input data. The DNNSN performed better than LSTM, but not as well as the GELM in terms of the RMSE. The LSTM-CapsAtt made a great leap in performance, as the RMSE and PCC were 0.029 and 0.989, respectively, which is obviously better than the previous algorithms. Compared to the LSTM-CapsAtt, the DCRA_E was slightly insufficient, but the DCRA_M was better than the LSTM-CapsAtt, which shows that similarity measurements based on Mahalanobis distance have certain advantages. In order to compare the performance of algorithms more accurately, the Friedman test was performed on some algorithms in Table 2. As shown in Table 3, the entire dataset was divided into five small datasets, namely D 1 , D 2 , D 3 , D 4 , and D 5 . We selected six algorithms with better results from the 10 algorithms in Table 2, performed five-fold cross-validation, and sorted the results, and then the average order value (AOV) was calculated. Using SPSS software to perform a non-parametric Friedman test under α = 0.05, the chi-square value of 20.805 was calculated, and looked up the table to get the chi-square critical value of 12.592. Therefore, the hypothesis that "all algorithms have the same performance" was rejected, as the performances of the six algorithms were significantly different. We used Nemenyi to further test the performance of the algorithm. First, we used Formula (18) to calculate the critical difference (CD) of the AOV: where k is the number of algorithms, N is the number of datasets, and q α can be obtained by looking up the table, where α = 0.05. If the difference between the average sequence values of the two algorithms exceeds the CD, the assumption that the performance of the two algorithms is the same is rejected. When k = 6, N = 5, and q α = 2.850, then CD = 3.372. Figure 5 was drawn according to the ordinal results in Table 3. In Figure 5, the vertical axis shows each algorithm, and the horizontal axis is the average ordinal value. For each algorithm, a dot displays its AOV, and the horizontal line segment with the dot in the center represents the size of the CD. If the horizontal line segments of the two algorithms overlap, it means that there was no significant difference between the two algorithms; otherwise, it means that there was a significant difference. It can be easily seen from Figure 5 that there was no significant difference between the algorithms DCRA_M and LSTM-CapsAtt, because their horizontal line segments have overlapping areas. However, the DCRA_M is closer to the vertical axis, so the performance of DCRA_M was better than that of the LSTM-CapsAtt. Similarly, there was no significant difference between the algorithm DNNSN and the LSTM and GELM algorithms. However, the DNNSN is closer to the vertical axis, so the performance of DNNSN was better than LSTM and the GELM. The algorithms DCRA_M and LSTM-CapsAtt do not overlap with the DNNSN, LSTM, and GELM, indicating that the DCRA_M and LSTM-CapsAtt are significantly better than the DNNSN, LSTM, and GELM. DCRA_E is in the middle of all the algorithms, so the performance of DCRA_E was also moderate.  Table 3. In Figure 5, the vertica axis shows each algorithm, and the horizontal axis is the average ordinal value. For each algorithm, a dot displays its AOV, and the horizontal line segment with the dot in the center represents the size of the CD. If the horizontal line segments of the two algorithms overlap, it means that there was no significant difference between the two algorithms otherwise, it means that there was a significant difference. It can be easily seen from Figure  5 that there was no significant difference between the algorithms DCRA_M and LSTM CapsAtt, because their horizontal line segments have overlapping areas. However, the DCRA_M is closer to the vertical axis, so the performance of DCRA_M was better than that of the LSTM-CapsAtt. Similarly, there was no significant difference between the al gorithm DNNSN and the LSTM and GELM algorithms. However, the DNNSN is closer to the vertical axis, so the performance of DNNSN was better than LSTM and the GELM The algorithms DCRA_M and LSTM-CapsAtt do not overlap with the DNNSN, LSTM and GELM, indicating that the DCRA_M and LSTM-CapsAtt are significantly better than the DNNSN, LSTM, and GELM. DCRA_E is in the middle of all the algorithms, so the performance of DCRA_E was also moderate. To further verify the fusion effect, the DCRA_E and DCRA_M were compared to the single-modal deep recurrent auto-encoder model. Single-modal deep recurrent auto-en coder input only uses EEG or EOG, and there is no coupling layer. Other hierarchica structures are consistent with the coupling auto-encoder structure. The values of the RMSEs and PCCs of the four methods are listed in Table 4. As shown in Table 4, the RMSE and PCC values when 0.085 and 0.854, respectively when EEG was used as the only input to the DRA. When EOG was used as the only inpu to the deep recurrent auto-encoder (DRA), the RMSE and PCC values were 0.095 and 0.805, respectively. Both the DCRA_E and DCRA_M were better than the single-moda methods, and the improvement is obvious. Because of the intrinsic relationship between EEG and EOG, the DCRA can reinforce the common features of different modes and pro vide complementary information to achieve better results than a single mode. To further verify the fusion effect, the DCRA_E and DCRA_M were compared to the single-modal deep recurrent auto-encoder model. Single-modal deep recurrent autoencoder input only uses EEG or EOG, and there is no coupling layer. Other hierarchical structures are consistent with the coupling auto-encoder structure. The values of the RMSEs and PCCs of the four methods are listed in Table 4. As shown in Table 4, the RMSE and PCC values when 0.085 and 0.854, respectively, when EEG was used as the only input to the DRA. When EOG was used as the only input to the deep recurrent auto-encoder (DRA), the RMSE and PCC values were 0.095 and 0.805, respectively. Both the DCRA_E and DCRA_M were better than the single-modal methods, and the improvement is obvious. Because of the intrinsic relationship between EEG and EOG, the DCRA can reinforce the common features of different modes and provide complementary information to achieve better results than a single mode.
At the same time, it can be seen from Tables 2 and 4 that the fusion methods, including CCRF, CCNF, GELM, DNNSN, and LSTM-CapsAtt, were better than the single-modal method in the test results. It can be said that multi-modal fusion can improve the accuracy of the model and generally performs better than the single-modal method. Overall, the DCRA_M performed better than the other fusion methods.

Analysis of α
In Formula (5), α is the coupling factor of the joint loss function. In order to verify the influence of coupling loss on the joint loss function and RMSE, α values were set as 0, 0.2, 0.4, 0.8, and 1 respectively, and the RMSE values corresponding to different α values were obtained in the experiment, as shown in Figure 6. As can be seen from Figure 6, α values achieved good results at 0.2, 0.4, and 0.8. When α was equal to 0.4, the RMSE value was at the minimum. When α was equal to 0, the coupling loss was 0, and only single-modal loss played a role, which was the worst effect. When α was equal to 1, the coupling loss weight was the largest, and the single-modal loss was 0, so the effect was not good. Theoretically, an α value that is too small will overemphasize the "individuality" of the data and ignore the correlation, while an α value that is too large will overemphasize the correlation and ignore the "individuation" of the data. Therefore, the effect is better when the α value is moderate.
ing CCRF, CCNF, GELM, DNNSN, and LSTM-CapsAtt, were better than the method in the test results. It can be said that multi-modal fusion can improv of the model and generally performs better than the single-modal method DCRA_M performed better than the other fusion methods.

Analysis of α
In Formula (5), α is the coupling factor of the joint loss function. In orde influence of coupling loss on the joint loss function and RMSE, α values wer 0.4, 0.8, and 1 respectively, and the RMSE values corresponding to different obtained in the experiment, as shown in Figure 6. As can be seen from Figu achieved good results at 0.2, 0.4, and 0.8. When α was equal to 0.4, the RM at the minimum. When α was equal to 0, the coupling loss was 0, and only loss played a role, which was the worst effect. When α was equal to 1, the weight was the largest, and the single-modal loss was 0, so the effect was n oretically, an α value that is too small will overemphasize the "individualit and ignore the correlation, while an α value that is too large will overemph relation and ignore the "individuation" of the data. Therefore, the effect is be α value is moderate.

Conclusions
Vigilance estimation based on EEG and EOG multi-modal data fusio search topic and has high research value and practical prospects. In this p coupling recurrent auto-encoder model that combines EEG and EOG is p model constructs a coupling layer, which links EEG and EOG together. Wh ing the coupling loss function of the model, the Mahalanobis distance is lear urements to calculate the similarity of two different modal data. In order gradient stability of learning long sequences, a multi-layer GRU is used to auto-encoder model. The deep coupling recurrent auto-encoder model integ ture extraction and feature fusion. The results of our experiments show that method is superior to the single-modal method and the latest multi-modal fu Based on the comparisons of experimental results using different methods, that the proposed method can handle the multimodal data fusion and pr dimensional vectors of data from different types of sensors into a common which enables effective classification of multi-model data. However, our me some problems, such as the need to take out part of the experimental data t halanobis matrix, and this part of the data must be consistent with the dat deep model training. At the same time, the Mahalanobis matrix used in the affects the speed of model convergence, and the choice of metric learning needs further discussion.

Conclusions
Vigilance estimation based on EEG and EOG multi-modal data fusion is a hot research topic and has high research value and practical prospects. In this paper, a deep coupling recurrent auto-encoder model that combines EEG and EOG is proposed. This model constructs a coupling layer, which links EEG and EOG together. When constructing the coupling loss function of the model, the Mahalanobis distance is learned by measurements to calculate the similarity of two different modal data. In order to ensure the gradient stability of learning long sequences, a multi-layer GRU is used to construct the auto-encoder model. The deep coupling recurrent auto-encoder model integrates data feature extraction and feature fusion. The results of our experiments show that the proposed method is superior to the single-modal method and the latest multi-modal fusion method. Based on the comparisons of experimental results using different methods, we observed that the proposed method can handle the multimodal data fusion and project the high dimensional vectors of data from different types of sensors into a common latent space, which enables effective classification of multi-model data. However, our method also has some problems, such as the need to take out part of the experimental data to learn a Mahalanobis matrix, and this part of the data must be consistent with the data required for deep model training. At the same time, the Mahalanobis matrix used in the loss function affects the speed of model convergence, and the choice of metric learning method also needs further discussion.
Deep learning has achieved promising results with EEG and EOG fusion, but it also faces some challenges. First of all, there is not a sufficient solution to measure the similarity between different modes, and this area needs more in-depth research and discussion. In addition, our next step is to find a suitable framework for multi-modal fusion.