1. Introduction
The fatality rate of traffic accidents is very high. According to statistics, millions of people die from traffic accidents every year. Tired driving and inattention are the main causes of traffic accidents. Modern sensor technology has been widely used in driver condition monitoring, and has reduced traffic accidents to a certain extent and saved thousands of lives [
1].
A portable wearable device can collect electroencephalography (EEG) and electrooculography (EOG) signals, which are used to evaluate the driver’s state in real time [
2,
3,
4]. EEG signals can directly reflect the activity of the human brain and capture the changes of radio waves caused by fatigue or drowsiness [
5]. EEG is a promising neurophysiological indicator that has distinguished wakefulness from sleeping in various studies. EOG measures the potential between the front and back positions of human eyes, which contains information about vigilance and eye movement, the latter of which is an effective indicator of human psychological activities. EEG and EOG signals come from different sensors, and such data are called multi-modal data. Multi-modal fusion methods include the multi-core method, the graph model, and the neural network method [
6]. In recent years, multi-modal data fusion has attracted extensive attention [
6,
7,
8], and the fusion of vibration signals and acoustic signals with different attributes and characteristics provides better fault diagnosis results [
9,
10]. In the field of artificial intelligence, the fusion of image, sound, text, and video is a current research hotspot [
11,
12,
13].
EEG and EOG represent internal cognitive state and external subconscious behaviors, respectively, and the information gathered by these two modes is closely connected and complementary. Related studies have shown that the fusion of EEG and EOG in alertness analyses has obvious advantages over the use of each of them alone [
4,
11,
12,
13,
14]. However, there are some difficulties regarding the integration of EEG and EOG. On the one hand, there are some human disturbances within the data itself. In the process of data monitoring, the unconventional actions and thinking changes of the tested body make data noise, which is difficult to uncover. On the other hand, multi-modal analysis of biological signals is very difficult, and it is a challenging task to identify complementary and contradictory information from the available signals. In addition, the lack of an ideal synchronization method between modes is another challenge related to multi-modal fusion analysis.
A number of machine learning methods based on EEG and EOG have been proposed for vigilance estimation. For example, support vector regression (SVR) was applied to EEG, EOG, and multi-modal EEG and EOG, and was used as a benchmark to evaluate other models [
2]. Vigilance is a dynamic process because the user’s internal psychological state is involved in time evolution. In order to incorporate time dependence into vigilance estimation, continuous conditional neural field (CCNF) and continuous conditional random field (CCRF) were introduced in [
2] to construct a vigilance estimation model. The authors of [
3] proposed a multi-modal fusion strategy that uses the depth auto-encoder model to learn better sharing. The authors of [
4] put forward a method of an adversarial domain adaptive network for reusing data, which saves the time of labeling a large amount of data. Huo [
15] used the discriminative graph regularized extreme learning machine (GELM) to evaluate the driver’s state. An extreme learning machine is an efficient and practical feedforward neural network with a single hidden layer. The authors of [
16] put forward a continuous vigilance estimation method using long- and short-term memory (LSTM) neural networks and combining EEG and EEG signals from the forehead. This method explores time-dependent information and significantly improves the performance of vigilance estimation. The authors of [
17] proposed a double-layered neural network with subnetwork nodes (DNNSN), which is composed of several subnet nodes, and each node is composed of many hidden nodes with various feature selection capabilities. Zhang [
14] suggested that the capsule attention model and deep LSTM should be integrated with EEG and EEG. The capsule attention model learns the temporal and hierarchical/spatial dependencies in the data through the LSTM network and the capsule feature representation layer.
In recent years, deep neural networks have been widely studied regarding the fusion of EEG and EGG, and promising results have been achieved [
13]. The convolutional neural network (CNN), recurrent neural network (RNN), auto-encoder, anti-neural network, and attention model are widely used in feature extraction and the fusion of EEG and EOG. In terms of image reconstruction and image fusion, auto-encoders and convolutional neural networks also show their advantages. The authors of [
18] showed that even simple autoencoders can be trained to reconstruct an image in such a way that the human eye would not be able to distinguish the noise and signal from a damaged sample. Professor Lu Baoliang of Shanghai Jiao Tong University and his team have done a lot of work regarding the integration of EEG and EOG. They have done relevant simulation tests, collected a large amount of test data, and put forward a series of vigilance evaluation methods [
2,
4,
19,
20,
21] based on these test data. The experimental data used in this study came from Lu Baoliang’s team.
Auto-encoders can extract deep features of data and remove noise interference. An RNN with a memory function has a good effect on processing time-series data, and it does not require the high synchronization of time. We designed a feature extraction and fusion framework—a deep coupling recurrent auto-encoder (DCRA) model, which can effectively solve the above problems. Our contributions include the following:
The DCRA uses multi-layer gated recurrent units (GRUs) to extract deep features and uses the joint objective loss function to fuse them together.
The joint loss function uses Euclidean distance similarity metrics in a single mode, and the multi-modal loss is measured by a Mahalanobis distance of metric learning [
22,
23], which can effectively reflect the distance between different modal data so that the distance between different modes can be described more accurately in the new feature space based on the metric matrix, and the losses of the two modes are summed according to weights.
Compared to the latest fusion method and the single-modal method, the method proposed in this paper has a lower root mean square error (RMSE) and a higher Pearson correlation coefficient (PCC).
The remainder of this paper is organized as follows: In
Section 2, the auto-encoder and metric learning are described. The deep recurrent auto-encoder is then extended to a deep coupling recurrent auto-encoder and a combinational model. The experimental data and evaluation methods are introduced in
Section 3.
Section 4 describes the experimental results and compares the performance of different models. Conclusions are presented in
Section 5.
5. Conclusions
Vigilance estimation based on EEG and EOG multi-modal data fusion is a hot research topic and has high research value and practical prospects. In this paper, a deep coupling recurrent auto-encoder model that combines EEG and EOG is proposed. This model constructs a coupling layer, which links EEG and EOG together. When constructing the coupling loss function of the model, the Mahalanobis distance is learned by measurements to calculate the similarity of two different modal data. In order to ensure the gradient stability of learning long sequences, a multi-layer GRU is used to construct the auto-encoder model. The deep coupling recurrent auto-encoder model integrates data feature extraction and feature fusion. The results of our experiments show that the proposed method is superior to the single-modal method and the latest multi-modal fusion method. Based on the comparisons of experimental results using different methods, we observed that the proposed method can handle the multimodal data fusion and project the high dimensional vectors of data from different types of sensors into a common latent space, which enables effective classification of multi-model data. However, our method also has some problems, such as the need to take out part of the experimental data to learn a Mahalanobis matrix, and this part of the data must be consistent with the data required for deep model training. At the same time, the Mahalanobis matrix used in the loss function affects the speed of model convergence, and the choice of metric learning method also needs further discussion.
Deep learning has achieved promising results with EEG and EOG fusion, but it also faces some challenges. First of all, there is not a sufficient solution to measure the similarity between different modes, and this area needs more in-depth research and discussion. In addition, our next step is to find a suitable framework for multi-modal fusion.