Feature Weight Driven Interactive Mutual Information Modeling for Heterogeneous Bio-Signal Fusion to Estimate Mental Workload

Many people suffer from high mental workload which may threaten human health and cause serious accidents. Mental workload estimation is especially important for particular people such as pilots, soldiers, crew and surgeons to guarantee the safety and security. Different physiological signals have been used to estimate mental workload based on the n-back task which is capable of inducing different mental workload levels. This paper explores a feature weight driven signal fusion method and proposes interactive mutual information modeling (IMIM) to increase the mental workload classification accuracy. We used EEG and ECG signals to validate the effectiveness of the proposed method for heterogeneous bio-signal fusion. The experiment of mental workload estimation consisted of signal recording, artifact removal, feature extraction, feature weight calculation, and classification. Ten subjects were invited to take part in easy, medium and hard tasks for the collection of EEG and ECG signals in different mental workload levels. Therefore, heterogeneous physiological signals of different mental workload states were available for classification. Experiments reveal that ECG can be utilized as a supplement of EEG to optimize the fusion model and improve mental workload estimation. Classification results show that the proposed bio-signal fusion method IMIM can increase the classification accuracy in both feature level and classifier level fusion. This study indicates that multi-modal signal fusion is promising to identify the mental workload levels and the fusion strategy has potential application of mental workload estimation in cognitive activities during daily life.


Introduction
Mental workload influences human performance in the specific scene or task. In recent years, heavy workload has become the ubiquitous phenomenon that may decrease the task efficiency, threaten human health and cause serious accidents. It is important to monitor and estimate mental workload levels for some particular jobs such as pilot, soldier, crew, and surgeon. Traditional measurement uses questionnaires or mental fatigue scales such as the NASA Task Load Index (NASA-TLX) to estimate mental workload. The self-rating methods can be utilized as the standard for mental workload estimation because they have the reliability and the sensitivity. These methods are practical for clinical trials and scientific experiments. However, they are subjective and cannot record the mental workload states continuously. Therefore, the focus on physiological signals for mental workload estimation is increasing.
EEG is one of the most important physiological signals to analyze mental workload. It reflects the electrical activity of the cortex directly [1]. EEG has high temporal resolution, which is important to measure mental states continuously. Because of the sensitivity to cognitive stimuli, EEG is capable of conducting experiments for mental workload estimation. Worldwide research groups proposed their methods based on EEG with a hope that mental workload estimation could be accurate and convenient.
Many researchers developed the classifiers for mental workload estimation. Baldwin et al. designed an artificial neural network for mental workload classification based on EEG spectral analysis [2]. Dong Qian proposed Bayesian-copula discriminant classifier (BCDC) based on the copula theory and kernel density estimation to detect drowsiness during daytime [3]. Some researchers focused on EEG feature extraction. For example, Roy et al. developed an efficient mental workload estimation method by the combination of power spectral density (PSD) features, and event-related potential (ERP) features [4]. Rifai Chai et al. focused on the EEG source separation and proposed an independent component analysis method by entropy rate bound minimization analysis (ERBM-ICA) to improve the driver fatigue classification [5]. Their research demonstrates that signal processing before machine learning is also important to improve the performance of mental workload estimation. In recent years, some researchers began to try deep learning methods for cross-day mental workload estimation. Zhong Yin et al. developed an adaptive Stacked Denoising AutoEncoder (SDAE) for mental workload estimation and explained that deep learning methods might be superior in comparison with static classifiers [6]. Ryan G. Hefron et al. announced that temporal dependency of EEG was promising to improve the cross-day cognitive workload estimation [7]. They achieved a significant result in within-participant cross-day condition based on deep long short-term memory structures.
Though EEG attracts most researchers in this field, research on the other physiological signals will enhance the performance of mental workload estimation. Hoover et al. proposed a real-time detecting method based on heart rate variability (HRV), which demonstrated that HRV might be a good indicator of mental workload [8]. However, the single-modal signal has the limitation to classify different mental workload levels accurately. It is a major challenge to improve the detection accuracy based on multi-modal physiological signals [9]. Whang et al. collected EEG and ECG signals to research 3D visual fatigue using heartbeat evoked potential (HEP) based on heart-brain synchronization [10]. Florence et al. combined EEG feature vector and ECG feature vector to improve the rapid detection of mental fatigue and found that the combined feature vector can enhance the capability of the classifiers [9]. Moreover, Jagannath et al. assessed the early onset of driver fatigue using EEG, ECG, blood pressure, oxygen saturation level and surface electromyography [11]. Gergelyfi et al. measured EEG, pupil size, eye blinks, skin conductance responses of the subjects in different work memory tasks [12]. Even though many types of physiological signals have been researched to estimate mental workload, few researchers focused on the signal fusion strategy. They just found the statistical results between physiological signals and different mental workload states but neglected that the heterogeneous signal fusion may be a key to improve the performance of mental workload estimation methods.
Signal fusion methods have attracted the attention of many researchers for solving pattern recognition problems. Signal fusion is also promising to extend the application of wireless sensor networks in various fields [13]. It provides the interface to utilize the large-scale information and improves the classification results. The researchers usually divide signal fusion methods into three categories which are early fusion, intermediate fusion, and late fusion [14]. Early fusion is also named as feature level fusion which emphasizes the data combination before the classification. The final feature vector consists of the features extracted from heterogeneous signals, and early fusion should put the final feature vector into the classifier alone. However, as for late fusion, different feature vectors should be fed into the classifiers respectively, and the final prediction is the combination of the different classification results. Therefore, late fusion is also called classifier level fusion or decision level fusion. Intermediate fusion represents the method between early fusion and late fusion.
Multi-modal physiological signal fusion is promising to solve biometric pattern recognition problems. For body sensor networks, multi-sensor fusion is fundamental to the applications of health-monitoring, motion recognition and other applications of the Internet of Things [15]. Verma et al. concatenated the feature vectors based on heterogeneous physiological signals for emotion recognition and validated the effectiveness of feature level fusion [14]. Hogervorst et al. combined the information of EEG, skin conductance, respiration, ECG, pupil size and eye blinks for mental workload estimation [16]. He concatenated features for feature level fusion and used the average score for classifier level fusion. Christensen et al. combined the features of EEG, ECG, and EOG and applied ANN, SVM, and LDA to validate the fused feature vector. After feature concatenation, Yin et al. embedded the feature selection method into signal fusion, which improved the performance for mental workload estimation [17].
Though some researchers have begun to use signal fusion methods to estimate mental workload, it is still necessary to improve the fusion algorithms. They concatenated different feature vectors but did not consider the dependency and redundancy information of the features. However, beyond the information combination, information filtering is also indispensable. For solving this problem, this paper proposes interactive mutual information modeling (IMIM) for both feature level and classifier level fusion to increase the classification accuracy of different mental workload states. Mutual information is an efficient feature selection method which can utilize the dependency information and eliminate the redundancy information [18]. However, few researchers think of its potential to estimate the feature weights for signal fusion. This paper optimizes the mutual information algorithm and extends its application to solving feature level and classifier level fusion problems. Considering the complicated interaction of the features extracted from different signals, This paper propose IMIM and validate it based on the features of EEG, ECG signals. Feature level and classifier level fusion are completed based on IMIM.
The main contribution of this work is threefold: First, this study proposes interactive mutual information modeling to estimate feature weights. Second, feature level and classifier level fusion methods are developed based on the feature weights. Third, mental workload classification accuracy is improved. Because of the ability to analyze the relationship between physiological signals and mental workload states, IMIM can be utilized to develop the body sensor networks for mental workload estimation.
The remainder of this paper is structured as follows. Section 2.1 introduces data recording in the mental workload tasks. Section 2.2 summarizes the important features for mental workload estimation and explains the extraction of feature vectors for heterogeneous bio-signal fusion. Section 2.3 displays the historical evolution of mutual information and describes the derivation of the proposed objective function of interactive mutual information modeling (IMIM). Section 2.3 also introduces the development of feature level and classifier level fusion methods based on IMIM. The experiments are presented in Section 3 to evaluate the performance of the proposed method. Feature level fusion methods, classifier level fusion methods, and other mental workload estimation methods are all utilized for comparison. Finally, Section 4 gives the conclusion.

Materials
This study invited ten subjects from Tsinghua University to take part in the experiment. They were all males and right-handed. The ages of the participants ranged from 22 to 28. All of the subjects were asked to stay away from caffeine and alcohol for at least 24 h. They were required to have enough sleep before the experiment. These restrictions were helpful to guarantee that the participants had the same baseline to start the work memory tasks.
Memory workload is an important aspect of mental workload. Memory workload can be defined as the ability to memorize and analyze short-term information [19]. Heavy memory workload will disable humanity from solving serious problems in real life. This paper uses memory workload as an example to explore mental workload estimation methods. To induce different memory workload levels, most researchers utilize several kinds of mental workload tasks based on the assumption that the harder task can cause the higher mental workload. One of the most practical tasks is the n-back task which was first designed in 1958 by Kirchner [20]. Because of its convenience and effectiveness, the n-back task has been widely used to research memory workload based on the dynamic information of letters and positions. In order to ensure the reliability and comparability of the proposed method, this study used traditional 1-back, 2-back, and 3-back position memory tasks to induce low, medium and high mental memory workload.
During the n-back task, as Figure 1 shows, the screen of a computer displayed a big square which consisted of nine different positions. A small blue block would appear randomly at one of the nine areas every three seconds. In the 1-back task, the subjects should compare the current position of the blue block with the preceding one. They needed to press the A key on the keyboard as quickly as possible when the 2 positions became the same. Analogously, participants should compare the current position with the one before just one in the 2-back position task and judge the previous position of the one before just one in the 3-back position task. To validate the effectiveness of the n-back task, we recorded the reaction time and the correct ratio of the subjects for statistical analysis. This study collected multi-modal physiological signals from 3 different workloads in the experiment which consisted of the 1-back task, 2-back task, and 3-back task. The three tasks were supposed to induce low, medium and high memory workload. As Table 1 shows, before the experiment, subjects would be given 5 min to calm down and prepare for the tasks. Each session contained one task which had 200 trails of random positions. After each task, the subjects should complete the NASA Task Load Index for self-rating. Then they would be given 3 min to have a rest and prepare for the next task. The entire experiment lasted 47 min. Table 1. Experiment with 3 sessions according to 1-, 2-, 3-back tasks.
Step 1 Step 2 Step 3 EEG and ECG signals were both collected during the tasks. This study used 16 channel EEG headset based on the 10-20 system with a 1000 Hz sampling rate to obtain EEG signals. The average of two ear electrodes was the reference of the EEG headset. This study used a patient monitor manufactured by Mindray company to record ECG signals and calculate the R-R intervals during the experiment. There were 600 s physiological signals during each n-back task, and the samples were extracted using 90 s signals with a 3 s step. Therefore, we obtained 171 samples from every subject in each task. The response time and the accuracy of the subjects in this experiment were recorded to analyze their performance.

EEG Feature Extraction
EEG is the most important physiological signal to analyze mental workload because it reflects the cortical activities directly [21]. However, EEG is so weak that multiple kinds of noise may be induced in the recording process. The main artifacts of EEG are caused by eye blinks, muscle contraction and other devices in the measurement system [22]. ICA (independent component analysis) has been widely used to calibrate the noise sources for artifact removal [23].
This study used an EEG analysis toolbox named EEGLAB [24] for the preprocessing of EEG signals. First, EEG signal was filtered from 0.5 Hz to 100 Hz to remove the direct current voltage and high-frequency artifacts. Second, this study utilized ADJUST algorithm to remove eye blink artifacts based on ICA [25]. Third, EEG signal was segmented as 3 s epochs based on the stimuli in the n-back tasks. There were 200 epochs of every person in each task. Each sample was extracted using 30 epochs with 1 epoch step. Therefore, this study collected 5130(171 × 3 × 10) samples for mental workload estimation.
Many research groups developed their mental workload estimation algorithms and proposed numerous types of features. EEG power spectral density (PSD) features and event-related potential (ERP) features were the most effective features for mental workload estimation.
PSD features were extracted based on the concatenation of all the epochs in each sample. In recent years, Welch's method, a periodogram spectrum estimator, has been utilized to extract EEG PSD features [26,27]. Due to the insensitiveness to the noise, we used it to extract 2 types of PSD features. These PSD features have proved their effectiveness in the previous study. First, we extracted the PSD features from 5 [2,[28][29][30][31][32][33][34]. Second, the PSD features of all frequency bands from 1 Hz to 40 Hz in a 1 Hz step were calculated [35,36]. Therefore, 736(46 × 16) PSD features were extracted from 16 EEG channels.
ERP features have been widely used to analyze mental workload because they can reflect the EEG activity according to visual stimuli. In this study, ERP signal was calculated using the average epoch of each sample from 0 to 1000 ms after the onset of each stimulus. This study extracted 2 types of ERP features from the ERP signal based on the previous research. First, the ERP signal was down-sampled to 100 Hz, and every point of the signal might be a useful feature [16,37]. Then 101 features were extracted from each channel. Second, this study calculated the value of the wave peak, wave valley, and the corresponding frequency of the peak and valley in each channel to obtain another 4 important ERP features [38]. Therefore, this experiment extracted 1680(105 × 16) ERP features in total.
Considering all of the PSD features and ERP features, this study used 2416 EEG features to explore the signal fusion methods for mental memory workload estimation.

ECG Feature Extraction
ECG has become one of the focuses for mental workload estimation. Heart rate (HR) and heart rate variability (HRV) have proved their efficiency to distinguish different mental workload levels. This study used the HRV analysis software (HRVAS) to extract numerous ECG features. HRVAS is a practical tool to extract time domain, frequency domain, time-frequency domain and nonlinear features for ECG analysis [39]. We used it to extract 103 ECG features based on each ECG sample.

Mutual Information
Section 2.2 has mentioned that this experiment utilized 2 EEG feature vectors and 1 ECG feature vector for mental workload estimation. Nevertheless, it is hard to estimate the importance of each feature in different feature subsets. Just combining various types of features cannot achieve satisfactory results. Research on heterogeneous signal fusion is still a challenge. Mutual information is an important index to reflect the dependency between the label vector and the feature vector [40]. Many researchers used mutual information for feature selection because it is effective to measure the performance of each feature subset. Besides its application for feature selection, it has the potential to estimate the weight of each feature for signal fusion. The preliminaries of mutual information are described as follows.
Assume that x represents a feature and y represents the label. The mutual information of x and y measures the dependency between them. Mutual information is denoted as I(x; y). The probability density functions (PDFs) and the joint PDF of the two variables are represented as P(x), P(y), P(x, y) respectively. The definition of mutual information is: The entropy of variable x is denoted as H(x) which represents the amount of information contained in x. And H(x|y) represents the conditional entropy which means the increased amount of information given by variable x when variable y has been known. Mutual information can also be represented as: where H(x) = − P(x) log P(x)dx, and H(x|y) = − P(x, y) log P(x|y)dxdy Nevertheless, Equation (1) can only measure the relevance of two variables. In order to measure the transmitted information between multi-dimensional feature vector and label, McGill extended the expression of mutual information [41]: where x denotes the n dimensional feature vector, and I(x; y) = I(x 1 , x 2 , · · · x n ; y). The joint entropy H(x) is denoted as: As Figure 2 shows, H(x) and H(y) represent the entropy of the original feature vector x and the label y respectively. The mutual information I(x; y) is a measure of the information shared between H(x) and H(y). Therefore, the focus of feature selection methods based on mutual information is to find the feature subset of x which can maximize the mutual information I(x; y).
Due to the complexity of the estimation of high dimensional probability density function, Equation (3) needs to be simplified for further calculation. Therefore, many research groups proposed the general equations to define the relationship between multi-variable entropy and mutual information [42,43]. Co-information which was proposed by Bell is practical to simplify the calculation based on mathematical transformation [43]. Provided that x represents the multi-variable {x 1 , x 2 , · · · , x n }, x (k) represents one of the subsets of x and the dimension of x (k) is denoted as |x (k) |, co-information can be defined as follows: Bell derived the general formulation of co-information according to Equation (5): In order to transfer the formulation of the mutual information given by Equation (3), Equation (5) defines the relationship between co-information and multi-variable entropy. The multi-variable entropy can be represented by a symmetrical formulation based on Equation (5): where n k is the dimension of the subset x (k) . Therefore, co-information can be used as a substitute to express mutual information based on Equation (7). Set theory has been used as an interpretation to explain the definition of co-information, i.e., I(x 1 ; x 2 ; · · · ; x n ) = f (X 1 ∩ X 2 ∩ · · · ∩ X n ) [44]. Equation (7) seems like a derivation of inclusion-exclusion principle.

H H y I y
If Equation (7) is substituted into Equation (3), the mutual information between multi-dimensional feature vector and label will be represented by co-information:

Feature Weight Estimation
The focus of this paper is heterogeneous bio-signal fusion for mental workload estimation. The calculation of the feature weights is indispensable for signal fusion. Because of the effectiveness of feature selection, mutual information method provides a promising application to calculating the dependency and redundancy information, which can extend to feature weight estimation.
Equation (8) gives the expanded form of mutual information which can be used to simplify the calculation by the truncated method: Each component of Equation (9) has the particular significance. I(x i ; y) denotes the dependency information which interprets the relevance between the feature vector and label. Nevertheless, I(x i ; x j ; y) denotes the redundancy information which should be limited. λ is a constant to adjust the ratio of dependency and redundancy information.
Because of the lower computational complexity, Equation (9) has been widely used to estimate the mutual information for feature selection. The goal of these feature selection methods is to find the optimal feature subset of x. If x * represents one of the subsets of x, the objective function is as follows: Though the objective Equation (10) represents the feature selection methods clearly, feature weight parameters can be supplemented to explain the formulation more clearly. If w denotes the feature weight vector of x, w i ∈ {0, 1} can represent whether x i is included in the subset x * . The Equation (10) will be transformed as: In the feature selection methods, w i is forced to be a boolean variable. However, if w i is limited as a positive real number from 0 to 1, it will be promising to optimize the feature wight based on Equation (11). Therefore, the original objective function for feature weight estimation proposed in this paper is as follows: Inspired by [44], this study uses mutual information matrix Q to simplify the objective function: · · · −I(x 1 ; x n ; y) −I(x 1 ; x 2 ; y) · · · −I(x 2 ; x n ; y) . . . . . . . . .
−I(x 1 ; x n ; y) · · · I(x n ; y) The mutual information matrix Q can be divided into the dependency matrix D and redundancy matrix R.
Therefore, the objective function for feature weight estimation can be derived as Equation (15): Though the objective function of feature weight estimation has been proposed, it is hard to solve Equation (15) which is a non-convex problem and the solution may be over-fitting. As Equation (16) shows, one of the feasible options is to add a 2 norm into the objective function.
The significance of adding 2 norm γw T w is threefold. First, it is a practical method to solve over-fitting problems. Second, 2 norm can be used as a sparse item which can adjust the scale of w and prevent too many features having big weights. Third, considering that (16) will be a convex problem. Then, a suitable γ can convert the non-convex problem to the convex problem which is easier to be solved. Therefore, γ ≥ | min {eig(R)} | is an important precondition to solve Equation (16), where eig(R) represents the eigenvalues of R. The final objective function can be defined as Equation (17), which is named as interactive mutual information modeling (IMIM).
In Equation (17), the objective function consists of 2 parts. w T D represents the dependency information and w T (R + γI)w represents the redundancy information. λ is a constant to adjust the ratio of the two parts. A big λ indicates the importance of increasing the dependency information.
Conversely, the small one means that to eliminate the redundancy information will be better. In fact, λ needs to be chosen according to different conditions, and Section 3.3 will explain this procedure.
Equation (17) is a typical example of quadratic programming problems, which can be solved using convex optimization toolbox. Since R + γI is a semi-definite matrix, this equation can be solved by several methods with the computational cost of polynomial time. For example, primal interior point method can solve the problem with O(n 3 L), where L represents the input size [45]. To solve this convex quadratic programming problem, we use YALMIP interface with MOSEK solver which is free for academic use [46,47]. MOSEK is one of the most efficient solvers to optimize the linear, quadratic and conic problems. Though MOSEK solver may not have the best performance for solving Equation (17), it is designed to exploit sparsity to reduce storage usage and computational time [47]. MOSEK solver can be used to solve several thousand dimensional vectors in nearly 2 min with Intel(R) Core(TM) i5-3470 CPU. This study just utilized MOSEK solver to validate the proposed method.

Heterogeneous Bio-Signal Fusion
This study extracted three kinds of feature vectors including EEG power spectral density (SPD) feature vector, EEG event-related potential (ERP) feature vector and ECG feature vector for heterogeneous bio-signal fusion. The normalization of the feature vectors is the precondition for further analysis. However, it is not the point of this paper. We just chose the unity-based normalization method according to Equation (18).
ALL of the features will be brought into [0, 1], which is convenient to adjust the importance of the features based on the feature weight vector. This paper subsequently develops feature level, and classifier level fusion methods based on IMIM for mental memory workload estimation.

Feature Level Fusion
Feature level fusion methods should combine the three feature vectors into a new feature vector and put it into one classifier. After the normalization of the feature vector, the feature weights can be calculated to adjust the scale of each feature by Equation (17). Given the feature vector x = {x 1 , x 2 , · · · , x n }, the weighted feature vector could be represented by x = {w 1 x 1 , w 2 x 2 , · · · , w n x n }. Because of the sparsity of w, many elements of the feature vector x should be 0. After removing the invalid features, we can obtain the fused feature vector x . k-NN and SVM are used to validate the performance of the fused feature vector and develop the feature level fusion methods. This kind of methods is named as interactive mutual information modeling for feature level fusion (IMIM-F). It is noteworthy that IMIM-F is not suitable for quite a few classifiers which are not sensitive to feature weights, such as decision tree.

Decision Level Fusion
In decision level fusion methods, different feature vectors are put into the classifiers respectively. And the classification scores are combined to calculate the final results. In this study, k-NN and SVM are used to validate the performance of classifier level fusion methods for mental workload estimation.
In order to obtain the coefficients of the weighted average of classification scores, three feature weight vectors (w (1) , w (2) , w (3) ) are calculated based on Equation (17) using EEG PSD feature vector, EEG ERP feature vector and ECG feature vector. The weight β i of each classifier is defined as the optimized value in Equation (17). This method is named as interactive mutual information modeling for classifier level fusion (IMIM-C).
As Equation (19) shows, the calculation of w (i) is the precondition to obtain β i . w (i) can adjust the dependency and redundancy information in each feature vector and it is helpful to strengthen the performance of each classifier. Therefore, different from the traditional classifier level fusion methods, IMIM-C changes the scale of each feature using w (i) before the classification. Each feature vector {x After that the 3 transformed feature vectors will be fed into the classifiers for classifier level fusion. The final predication is the weighted average of the classification scores based on β i .

Results
This paper collects EEG, ECG signals to estimate mental workload based on heterogeneous signal fusion. Feature level fusion methods combine various types of features into a whole feature vector and adjust the weight of each feature for information fusion. Decision level fusion methods suppose that every feature vector is independent. They estimate the weight of each classifier trained by each feature vector. The final prediction is the weighted average of the classification scores. Experiment validates IMIM-F and IMIM-C based on the features which are discussed in Section 2.2.
Cross-validation is necessary to separate training, and test datasets when the amount of the data is limited. It is also indispensable to limit the over-fitting problems during the training process of the classifiers. Over-fitting is a phenomenon that the machine learning algorithm is so complicated that the internal details of the training dataset are over concerned. The classifier will be disabled to generalize new data. In this paper, a method named as "leave-one-proband-out" is used for cross-validation [48]. It uses the samples of one subject as the test dataset and the samples of the other nine subjects as the training dataset. This paper selects the test dataset from 10 subjects in order and repeats the classification process 10 times. There are two advantages of using this method. First, "leave-one-proband-out" operates similarly to 10 fold cross-validation and it does not use any test data in the training process. It can evaluate the generalization ability of the classifiers and limit over-fitting issues based on the prediction of the test dataset which can be assumed as unseen data. Second, it can ensure the effectiveness of the experiment for subject-independent application [48]. The average accuracy and standard deviation are obtained based on the 10 classification results.

Analysis of the N-Back Task
Physiological signals were recorded in 3 different mental memory workload levels based on the 1-, 2-, 3-back tasks. It is necessary to validate that the 1-, 2-, 3-back tasks had different difficulty and the subjects suffered from different workload. Table 2 describes the results of the NASA Task Load Index (NASA TLX). Table 3 shows the performance of the subjects during these tasks. The NASA TLX is a subjective questionnaire to validate the effectiveness of the n-back task. Table 2 presents the self-rating results of all subjects. The results of the NASA TLX increase with the more-back tasks, with the average scores of 29.3 ± 6.4, 49.5 ± 4.5 and 69.6 ± 6.9 for the 1-, 2-, 3-back tasks respectively (repeated measures ANOVA: F (2,27) =111.52, p < 0.01). Tukey post hoc tests show that the subjects suffered from different mental workload during these tasks. Table 3. Average performance including the response accuracy and the reaction time of the ten subjects during the 1-, 2-, 3-back tasks. Besides the subjective measurement, Table 3 provides an objective validation of the mental workload tasks according to an inference that high mental workload will induce more errors and make people unresponsive. During the 1-back, 2-back, 3-back tasks, the response accuracy continues to decline (repeated measures ANOVA: F (2,27) =23.9, p < 0.01) and the reaction time is prolonged (repeated measures ANOVA: F (2,27) =81.6, p < 0.01). Tukey post hoc tests validate the differences of the three tasks. In the 1-back task, the average accuracy is 93.3%, and the standard deviation is 4.3%, which implies that all of the subjects accomplished this task successfully and they only suffered from the low memory workload. Nevertheless, the average accuracy decreases to 90.6% in the 2-back task, which indicates that the subjects needed to pay more attention and had the increased mental workload. The correct rate of the 3-back task is only 57.3% that is much lower than the previous two tasks. The subjects had the highest mental workload in the 3-back task since this task was too difficult. The time interval from the visual stimulus to the pressing of the keyboard was collected to analyze the reaction time of the subjects. The reaction time increases from the 1-back task to the 3-back task, which demonstrates that the 1-, 2-, 3-back tasks can induce low, medium and high mental memory workload respectively. Both the subjective and objective results confirm the differences of the 1-, 2-, 3-back tasks, which validates the inducing of three mental workload levels.

Data Recording
EEG and ECG signals were collected from ten subjects during the 1-, 2-, 3-back tasks to prove the effectiveness of IMIM for mental memory workload estimation. Figure 3 shows an example of the EEG signal which consists of 3 epochs collected from one subject during the 1-back task. This experiment used ADJUST algorithm to remove the artifacts of the EEG signal based on ICA [25]. ECG features were extracted based on R-R intervals provided by a patient monitor manufactured by Mindray company. Figure 4 shows an example of R-R intervals based on the ECG signal collected from one subject during the 1-back task.  As Section 2.2 shows, EEG power spectral density (PSD) features, EEG event-related potential (ERP) features and ECG features are extracted to validate the signal fusion methods for mental memory workload estimation. Usually, signal fusion methods consist of three categories: feature level fusion methods, classifier level fusion methods and the methods between feature level and classifier level fusion. The former two types of methods are the most popular for signal fusion, and this study validates IMIM in both feature level and classifier level fusion conditions.

Feature Level Fusion
Feature level fusion is an advanced strategy in comparison with classifier level fusion. It integrates all feature vectors and takes into account the relevance between any two features. Feature level fusion usually utilizes the information of the features more effectively and acquires the better performance than classifier level fusion.

Parameter Adjustment for Feature Weight Estimation
As Section 2.3 shows, the parameter λ in Equation (17) is important to adjust the scale of each feature. A small λ will increase the consideration of redundancy information and tend to reduce the feature weights to eliminate the it. IMIM will estimate more feature weights as 0, and the fused feature vector will become sparse. The process of the determination of λ based on the single feature vectors and the fused feature vectors should be explained to explore the most effective feature weights for different feature vectors.
This paper presents the selection process of λ based on k-NN (k = 3) and soft margin SVM (C = 10 −3 ) classifiers. The classification performance and the number of features whose weights are bigger than 0 according to λ are shown as Figures 5 and 6. There are several preliminaries of the features to describe Figures 5 and 6 clearly. ECG features consist of heart rate, R-R interval and different definitions of heart rate variability according to HRVAS [39]. PSD features and ERP features are extracted from EEG according to the power spectral density and moments of the stimuli respectively. EEG feature vector is the combination of the PSD features and ERP features. ALL feature vector is the combination of EEG features and ECG features. Among different feature vectors involved in this experiment, EEG feature vector and ALL feature vector are both the fused feature vectors. PSD feature vector, ERP feature vector and ECG feature vector are three single feature vectors.  It is obvious that the appropriate λ is different for different classifiers and feature vectors. Each feature vector has the exact amount of dependency and redundancy information. Different classifiers also have different ability to utilize the information. The parameter λ is exactly the tool to adjust the ratio of the two kinds of information. Therefore, λ cannot be set as a stable constant, and it should be re-selected in each model. Figures 5 and 6 also explain the changes in classification accuracy with the increase in the number of the useful features whose weights are bigger than 0. The trend is the same for all curves that the accuracy increases first and then drops because of the increase of redundancy information with the greater feature weight vector w. Figure 6. The classification accuracy using the single feature vectors (PSD feature vector, ERP feature vector, ECG feature vector) and fused vectors (EEG feature vector, ALL feature vector) according to λ based on SVM (C = 10 −3 ).

Necessity of Signal Fusion and Parameter Selection of the Classifiers
It is reasonable that better classification accuracy will be reached based on the fused feature vector which can provide more information than a single feature vector. Nevertheless, there are still two problems which may reject this inference. First, if the feature level fusion method is not appropriate and many redundant features are mistakenly considered important, the performance of the fused feature vector will be even worse than the single feature vectors. Second, if the classification accuracy based on the single feature vector is high enough, there will be no need to develop signal fusion framework. Therefore, Figure 7 compares the classification results between the single and fused feature vectors based on IMIM-F to validate the usefulness of this method and to explain the necessity of signal fusion. Figure 7 compares the single and fused feature vectors based on different classifiers. EEG PSD feature vector and EEG ERP feature vector have better performance than ECG feature vector, which emphasizes the importance of EEG features to estimate mental workload. EEG feature vector is the combination of EEG PSD feature vector and EEG ERP feature vector. The classification accuracy of EEG feature vector is obviously better than PSD feature vector and ERP feature vector, which validates the effectiveness of IMIM-F for homogeneous feature level fusion. ALL feature vector consists of 3 single feature vectors based on IMIM-F, and the performance is improved significantly. Therefore, the best performance based on a single feature vector is not sufficient, and it is still necessary to fuse different feature vectors. The fused feature vectors reach the higher classification accuracy, which proves the effectiveness of the proposed method. The classification results in Figure 7 can answer the two questions raised in the preceding paragraph. Figure 7 also presents the parameter selection for k-NN (k = 1, 3) and soft margin SVM (C = 10 −1 , 10 −3 ) classifiers. The k-NN classifier reaches better accuracy when k = 3. The parameter C affects the performance of SVM obviously, and it reaches higher accuracy when C = 10 −3 .  Table 4 presents the classification results using ALL feature vector based on different parameters. The parameter k of the k-NN classifier is chosen from (1,3,5,10). It is most suitable to use k = 3 for k-NN, which achieves the accuracy of 88.1%. The parameter C of soft margin SVM is chosen from (C = 10 −1 , 10 −2 , 10 −3 , 10 −4 ) and it is evident that SVM reaches the highest classification accuracy of 90.6% when C = 10 −3 . Therefore, the optimum parameters are k = 3 and C = 10 −3 for k-NN and SVM respectively.

Comparison of Feature Level Fusion Methods
Recent researchers have proposed several feature level fusion methods to improve the classification accuracy based on the utilization of multi-modal signals recorded from different types of sensors. Concatenation, Multi-kernel learning, and linear dependency modeling based on probability have all been used to explore the feature level fusion problem. Figure 8 compares the proposed method IMIM-F (interactive mutual information modeling for feature level fusion) with Concatenation method, VGGMKL [49], and LFDM (linear feature dependency modeling) [50] to verify the advancement of IMIM.  Figure 8 compares the feature level fusion methods in 2 aspects. First, it compares the improvement of these methods from the single feature vectors to the fused feature vectors. Second, it investigates the ability of the methods to distinguish different mental workload conditions across 1-, 2-, 3-back tasks. Figure 8 shows the performance of the methods in 6 conditions (1-back VS 2-back, 1-back VS 3-back, 2-back VS 3-back, 1-back VS 2-, 3-back, 1-, 2-back VS 3-back, and 1-back VS 2-back VS 3-back). All of the methods have poor performance in 2-back VS 3-back, and 1-, 2-back VS 3-back conditions. It implies that the mental workload levels in 2-back task and 3-back task are similar. However, these methods reach higher accuracy in 1-back VS 2-back, 1-back VS 3-back and 1-back VS 2-, 3-back tasks, which represents the big differences between 1-back and more-back tasks. Figure 8 (f) measures the performance of different methods in 1-back VS 2-back VS 3-back condition, and it provides the most comprehensive evaluation. In this Figure, IMIM-F based on k-NN does not perform better than other methods using EEG PSD feature vector. However, IMIM-F methods outperform the others using all of the rest feature vectors. In the comparison of heterogeneous feature level fusion in 1-back VS 2-back VS 3-back condition, the classification accuracy of IMIM-F methods based on k-NN and SVM reach 88.0% and 90.6% respectively, which validates the advancement of IMIM-F in feature level fusion. This paper uses "leave-one-proband-out" strategy for cross-validation. This method not only limits over-fitting problems but also develops a subject-independent mental workload classification application. This strategy uses the data from one subject as the test dataset and the data from the other nine subjects as the training dataset. Figure 9 shows the classification accuracy in 1-back VS 2-back VS 3-back condition based on ALL feature vector according to ten subjects involved in this experiment.  Figure 9. Comparison of feature level fusion methods according to ten subjects involved in this experiment. The subjects are numbered from A to J randomly. Figure 9 shows that the traditional Concatenation method has the worst performance. Simply concatenating different feature vectors cannot improve the classification accuracy significantly. IMIM-F methods, including IMIM-F-kNN and IMIM-F-SVM, improve the classification accuracy significantly compared with Concatenation, LFDM, and VGGMKL (paired t-test, p < 0.01). However, the difference between IMIM-F-kNN and IMIM-F-SVM is not significant (paired t-test, p = 0.4306). IMIM-F outperforms the other methods based on k-NN (k = 3) and SVM (C = 10 −3 ) for every subject, which validates that the over-fitting problems are limited, and IMIM-F is advanced for subject-independent mental workload estimation.

Decision Level Fusion
Decision level fusion is designed to combine the scores of different classifiers to improve the classification accuracy. The focus of these methods is to estimate the weight of each classifier and calculate the weighted average results. We can suppose that EEG power spectral density (PSD) feature vector, EEG event-related potential (ERP) feature vector, and ECG feature vector are independent. Decision level fusion methods usually put the three feature vectors into three classifiers respectively. For comparison, different classifier level fusion methods, including Average method, VGGMKL, LCDM (linear classifier dependency modeling), and IMIM-C (interactive mutual information modeling for classifier level fusion) are utilized to estimate mental memory workload.

Decision Level Fusion Method Based on IMIM
Compared with traditional classifier level fusion methods, IMIM-C has a different characteristic.
As Equation (19) shows, IMIM-C obtains the weight of each feature w (i) j and the weight of each classifier β i simultaneously. Therefore, it can optimize the scale of each feature before the classification process to improve the performance.
As Figure 10 shows, IMIM-C consists of four steps. First, this method calculates the weight of each feature w (q) p and the weight of each feature vector β q based on IMIM. Second, this method calculates the scaled feature vector to adjust the weights of the features in each feature vector. Third, IMIM-C puts three different feature vectors into the classifiers and obtains 3 predictions. Fourth, the final decision is the weighted average of the classification results based on the feature vector weight β q .

Comparison of Different Classifier Level Fusion Methods
Section 3.3 has described the selection of parameter λ in Equation (17) based on each feature vector. As Equation (19) shows IMIM-C estimates the weights of the features and the feature vectors simultaneously. Table 5 displays the comparison between IMIM-C and other classifier level fusion methods based on k-NN (k=1, 3) and SVM (C = 10 −1 , 10 −3 ) classifiers.
Researchers have proposed different kinds of classifier level fusion methods. The Average method uses the average score of several classifiers to obtain the final prediction. Multi-kernel learning (MKL) estimates the weights of different kernels based on several feature vectors and uses the weighted average kernel to train SVM for classifier level fusion, such as VGGMKL. Boost methods think of each classifier based on the single feature vector as a weak classifier and combine several weak classifiers for the better performance. LCDM improves the objective function of LP-B which is a multi-class variant of LPBoost [51]. Table 5 compares the proposed method IMIM-C with the Average method, VGGMKL, and LCDM.
SVM and k-NN are used to compare different classifier level fusion methods. The parameters of the classifiers are changed to demonstrate the effectiveness of IMIM-C and obtain better results. For k-NN, the number of the nearest neighbors is selected from {1, 3}. In the soft margin SVM, parameter C is chosen from {10 −3 , 10 −1 } and the classification accuracy will no longer change when C is larger than 10 −1 . Table 5 presents the average accuracy and the standard deviation based on the cross-validation strategy named "leave-one-proband-out". The results in 1-back VS 2-back VS 3-back condition provide the comprehensive analysis. In this condition, IMIM-C outperforms the other classifier level fusion methods for both SVM and k-NN classifiers, which indicates that IMIM-C is advanced for heterogeneous bio-signal fusion. IMIM-C reaches the best performance based on SVM when C = 10 −3 . It is evident that IMIM-C is sensitive to the parameter C and the small C is more suitable to avoid over-fitting. For k-NN, IMIM-C has the better performance when k is larger because it can utilize more information in the classification process. Table 5 also presents the ability of the classifier level fusion methods to distinguish the mental workload levels according to six pairs of 1-, 2-, 3-back tasks. IMIM-C outperforms the other methods in each condition based on each classifier, which validates the advancement of IMIM-C. Similar to the results of feature level fusion, all of the methods reach the higher accuracy in 1-back VS 2-back, 1-back VS 3-back and 1-back VS 2-, 3-back conditions. The mental workload induced by 2-, 3-back tasks is obviously higher than 1-back task. However, it is sometimes difficult to distinguish the mental workload levels in 2-back and 3-back tasks. The results indicate that the features extracted from EEG and ECG signals are different between the lowest workload and the increased workload conditions. But it is challenging to distinguish various levels of high mental workload.   Figure 11 shows the classification accuracy for each subject based on "leave-one-proband-out" cross-validation method. IMIM-C outperforms the other methods based on k-NN (k = 3) and SVM (C = 10 −3 ) (paired t-test, p < 0.01), which demonstrates that over-fitting problems are limited in the proposed method. IMIM-C is promising for subject-independent mental workload estimation.  Figure 11. Comparison of classifier level fusion methods according to 10 subjects involved in this experiment. The subjects are numbered from A to J randomly.  Table 6 compares IMIM with the other methods developed mainly for mental workload estimation. EEG is indispensable for mental workload estimation, and some researchers focused on this problem only using EEG signal [31,37,52]. Simultaneously, the fusion of EEG, ECG, and EOG has been explored to improve the classification accuracy. Some researchers concatenated features from heterogeneous signals [28,30]. However, the accuracy was not increased obviously because of the redundant features. Feature selection method was used before signal fusion to eliminate the unnecessary features and improve the performance of the classifiers [17]. To utilize the information of the heterogeneous feature vectors, we estimate the feature weights based on interactive mutual information modeling (IMIM). IMIM-F reaches the highest precision in Table 6 at 91% and IMIM-C also outperforms the other methods. IMIM improves the estimation of feature weights and develops the feature-level, classifier level fusion methods. It maximizes the effective information contained in various feature vectors and improves the performance based on different classifiers.

Discussion
The physiological signals recorded during 1-, 2-, 3-back tasks are used to validate IMIM for mental workload estimation. The former sections have presented the classification results and comparison. However, there are still some critical issues.
In the comparison of feature level fusion methods, IMIM-F reaches the highest precision compared with Concatenation method, LFDM, and VGGMKL. The proposed method has better performance using the fused feature vector than a single feature vector. However, there is no noticeable improvement based on the other methods when they combine heterogeneous signals. It indicates the difficulty in utilizing the dependency information and eliminating the redundancy information simultaneously. IMIM-F estimates the redundancy information based on the optimization of mutual information and combines the features based on the feature weights to utilize the dependency information. However, the other methods do not consider the interaction between any two features in the whole feature vector correctly. The results of feature level fusion validate the advancement of IMIM-F.
Classifier level fusion methods focus on the estimation of classifier weights. The performance of classifier level fusion is usually worse than feature level fusion because feature level fusion considers the interaction of all features and utilizes more information. Therefore, besides the IMIM method, all the signal fusion methods used in this study have a worse performance in classifier level fusion than feature level fusion. IMIM improves the performance in classifier level fusion significantly because of the optimization of each feature vector. Equation (17) estimates the weight of each feature w p (q) and the weight of each feature vector β q simultaneously. The redundancy information and dependency information are taken into account in one feature vector based on w p (q) . Therefore, IMIM-C has refined each feature vector before classification, which can improve the classification accuracy evidently. This paper also compares IMIM with other research for mental workload estimation. EEG is the most important signal for mental workload estimation, and many research groups proposed methods only based on EEG. In recent years, multi-modal biometric signal fusion has been one of the focuses, and the feature combination of EEG, ECG, EOG increases the mental workload classification accuracy. However, these researchers just concatenated different feature vectors, which may induce the redundancy information. As Table 6 shows, compared with other mental workload estimation methods, IMIM has the best performance. The results imply that heterogeneous bio-signal fusion is promising to classify different mental workload states. IMIM is an advanced method in both feature level and classifier level fusion.

Conclusions
This paper focuses on mental workload estimation based on heterogeneous bio-signal fusion. The high mental workload is harmful to human health and may cause particular people, such as pilots, soldiers, crew, and surgeons, to commit serious mistakes. Though EEG is the primary physiological signal to reflect workload, additional information of other physiological signals, such as ECG, can be used to improve the performance of the classifiers. This paper improves the objective function of mutual information and utilizes the 2 norm to transform the non-convex problem into the convex problem. Then interactive mutual information modeling (IMIM) is proposed to improve the mental workload estimation based on heterogeneous bio-signal fusion. IMIM extends the application of mutual information to estimating the weights of features and feature vectors for signal fusion. N-back task is utilized to induce different mental workload states. The proposed method fuses EEG power spectral density (PSD) features, EEG event-related potential (ERP) features and ECG features based on the data collected in 1-, 2-, 3-back tasks. The discussion and evaluation are conducted as threefold. First, IMIM-F (IMIM for feature level fusion) is compared with other feature level fusion methods. Second, IMIM-C (IMIM for classifier level fusion) is designed to compare with classifier level fusion methods, and it outperforms the other methods based on different classifiers. Third, it is necessary to compare IMIM with the previous research for mental workload estimation. Compared with recent study, IMIM reaches the highest classification accuracy because it fuses heterogeneous feature vectors based on the consideration of redundancy and dependency information. IMIM effectively improves the classification accuracy, and can be applied to monitoring mental workload. IMIM is also helpful to develop body sensor networks based on multi-modal physiological sensors. This study tries to propose the feature-engineering method which can apply to different classifiers only based on the n-back task. The future work is to develop modern machine learning algorithms cross different mental workload tasks.