Individual-Speciﬁc Classiﬁcation of Mental Workload Levels Via an Ensemble Heterogeneous Extreme Learning Machine for EEG Modeling

: In a human–machine cooperation system, assessing the mental workload (MW) of the human operator is quite crucial to maintaining safe operation conditions. Among various MW indicators, electroencephalography (EEG) signals are particularly attractive because of their high temporal resolution and sensitivity to the occupation of working memory. However, the individual di ﬀ erence of the EEG feature distribution may impair the machine-learning based MW classiﬁer. In this paper, we employed a fast-training neural network, extreme learning machine (ELM), as the basis to build an individual-speciﬁc classiﬁer ensemble to recognize binary MW. To improve the diversity of the classiﬁcation committee, heterogeneous member classiﬁers were adopted by fusing multiple ELMs and Bayesian models. Speciﬁcally, a deep network structure was applied in each weak model aiming at ﬁnding informative EEG feature representations. The structure of hyper-parameters of the proposed heterogeneous ensemble ELM (HE-ELM) was then identiﬁed and then its performance was compared against several competitive MW classiﬁers. We found that the HE-ELM model was superior for improving the individual-speciﬁc accuracy of MW assessments.


Introduction
In various human-machine (HM) cooperation systems-such as driving systems, brain-computer interface (BCI) systems, nuclear power plants and air traffic control [1]-it is difficult for human operators to maintain effective functional states when performing longtime duration tasks with a high complexity level. The reason behind this is that human cognitive capacity is affected and limited by multidimensional psychophysiological factors. It thus leads to unstable task performance compared to machine agents. One of the most important aspects of operator functional states (OFSs) is termed as mental workload (MW). It can be generally defined as the remaining cognitive resource or capacity of working memories under transient task demand [2].
Understanding the MW functionality can facilitate human-centered automation systems that both improve the safety and satisfaction level of the HM interaction. Aricò et al. [3] mentioned that the scope of the brain computer interfacing (BCI) system has been extended over the past decade and its functionality involves the monitoring of cognitive workload and emotional states that are identified by the user's spontaneous brain activity. Since the human executive capability is closely linked to temporal of the spacecraft. This software platform is used to simulate a safety-critical human-machine collaboration environment. Subjects need to concentrate on the operation because the trajectory can easily exceed the target range. Under the ACAMS, the ongoing physiological data of operators corresponding to different task complexities were acquired and used for modeling their transient MW level. The operation of the tasks was related to the maintenance of the air quality in a virtual spacecraft via human-computer interaction. Operators were required to manually monitor and maintain four variables (O2 concentration, air pressure, CO2 concentration, temperature) to their respective ranges according to given instructions. That is, when any of the subsystems' program runs incorrectly, operators manually control the task until the systems error is fixed. Simultaneously, the current EEG signals were recorded. The complexity for performing control tasks is measured by the number of manually controlled subsystems (NOS). According to different value of NOS, the complexity and demand of the manual control tasks is gradually changing, which can induce a variation of the MW.
Eight male participants (22-24 years) were engaged and coded by S1, S2, S3, S4, S5, S6, S7, and S8. All participants were volunteer graduate students from the East China University of Science and Technology. All participants have been trained in ACAMS operations for more than 10 h before the start of the formal experiment. As can be seen from our previous work [36], the average task performance was analyzed and its mean value of all subjects in the first control condition was not significantly varied compared with the last control condition (i.e., from 0.957 to 0.936). Moreover, all subjects cannot perfectly operate the ACAMS of high task complexity in both sessions (with average task performance of 0.783). Since the task demands in the high difficulty conditions were the same across two sessions, the habituation effect of subjects was properly controlled. The reason behind this is that each participant was trained for more than 10 h before the formal experiment and their task performance properly converged.
To rule out the effects of the circadian rhythm, each subject conducted the experiment from 2:00 p.m. to 5:00 p.m. for different sessions. The participants were required to perform two identical sessions of experiments. Each session consisted of eight control task load conditions. After the starting 5-min baseline condition under NOS with the value of 0, there were six 15-min conditions followed by the last 5-min baseline condition again. The duration of each session was 100 min (100 = 6 × 15 + 2 × 5). The operator's cognitive demand for manual tasks were gradually increased and then reduced within 100 min in one session. We selected the EEG data from six, 15-min conditions in each session. A total of 16 (2 × 8) physiological datasets were built. Under conditions #1, #2, #3, #4, #5, and #6, the participants needed to manually operate ACAMS under NOS with the value of 2, 3, 4, 4, 3, and 2. The subjective rating analysis was omitted and can be found in our previous work [33].

EEG Feature Extraction
The raw EEG signals and the extracted EEG features are depicted in Figure 2. The EEG data were measured via the 10-20 international electrode system on 11 positions of the scalp (i.e., F3, F4, Fz, C3, C4, Cz, P3, P4, Pz, O1, and O2) at the frequency of 500 Hz. Frontal theta power and the parietal alpha were used for MW variation detection. In addition, we also noticed that the EEG Operators were required to manually monitor and maintain four variables (O 2 concentration, air pressure, CO 2 concentration, temperature) to their respective ranges according to given instructions. That is, when any of the subsystems' program runs incorrectly, operators manually control the task until the systems error is fixed. Simultaneously, the current EEG signals were recorded. The complexity for performing control tasks is measured by the number of manually controlled subsystems (NOS). According to different value of NOS, the complexity and demand of the manual control tasks is gradually changing, which can induce a variation of the MW.
Eight male participants (22-24 years) were engaged and coded by S1, S2, S3, S4, S5, S6, S7, and S8. All participants were volunteer graduate students from the East China University of Science and Technology. All participants have been trained in ACAMS operations for more than 10 h before the start of the formal experiment. As can be seen from our previous work [36], the average task performance was analyzed and its mean value of all subjects in the first control condition was not significantly varied compared with the last control condition (i.e., from 0.957 to 0.936). Moreover, all subjects cannot perfectly operate the ACAMS of high task complexity in both sessions (with average task performance of 0.783). Since the task demands in the high difficulty conditions were the same across two sessions, the habituation effect of subjects was properly controlled. The reason behind this is that each participant was trained for more than 10 h before the formal experiment and their task performance properly converged.
To rule out the effects of the circadian rhythm, each subject conducted the experiment from 2:00 p.m. to 5:00 p.m. for different sessions. The participants were required to perform two identical sessions of experiments. Each session consisted of eight control task load conditions. After the starting 5-min baseline condition under NOS with the value of 0, there were six 15-min conditions followed by the last 5-min baseline condition again. The duration of each session was 100 min (100 = 6 × 15 + 2 × 5). The operator's cognitive demand for manual tasks were gradually increased and then reduced within 100 min in one session. We selected the EEG data from six, 15-min conditions in each session. A total of 16 (2 × 8) physiological datasets were built. Under conditions #1, #2, #3, #4, #5, and #6, the participants needed to manually operate ACAMS under NOS with the value of 2, 3, 4, 4, 3, and 2. The subjective rating analysis was omitted and can be found in our previous work [33].

EEG Feature Extraction
The raw EEG signals and the extracted EEG features are depicted in Figure 2. The EEG data were measured via the 10-20 international electrode system on 11 positions of the scalp (i.e., F3, F4, Fz, C3, C4, Cz, P3, P4, Pz, O1, and O2) at the frequency of 500 Hz. Frontal theta power and the parietal alpha were used for MW variation detection. In addition, we also noticed that the EEG power of the alpha frequency band in the central scalp region degreases along with the increase of the MW according to [37][38][39]. In [40], the EEG power of the occipital channels was shown to be related to the stress and fatigue. Therefore, in order to improve the diversity of the possible EEG features that were salient to workload variation, the central and occipital electrodes were also employed in the experiment. To cope with the electromyography (EMG) noise, the independent component analysis (ICA) was employed to filter the raw EEG data. The independent component associated with the muscular activity was identified by careful visual inspection and was eliminated before extracting EEG features. Moreover, the ACAMS operations were not heavily depended on the sensory motor functionality of the cortex. If such noise exists, the possible EEG clues could be allocated with small weight values since the supervised machine learning algorithm only adopted MW levels as target classes. Therefore, the irrelevant EEG indicators can be well controlled. power of the alpha frequency band in the central scalp region degreases along with the increase of the MW according to [37][38][39]. In [40], the EEG power of the occipital channels was shown to be related to the stress and fatigue. Therefore, in order to improve the diversity of the possible EEG features that were salient to workload variation, the central and occipital electrodes were also employed in the experiment. To cope with the electromyography (EMG) noise, the independent component analysis (ICA) was employed to filter the raw EEG data. The independent component associated with the muscular activity was identified by careful visual inspection and was eliminated before extracting EEG features. Moreover, the ACAMS operations were not heavily depended on the sensory motor functionality of the cortex. If such noise exists, the possible EEG clues could be allocated with small weight values since the supervised machine learning algorithm only adopted MW levels as target classes. Therefore, the irrelevant EEG indicators can be well controlled. The preprocessing steps of the acquired EEG signals are shown as follows:  (1) All EEG signals were filtered through a three-order low-pass IIR filter with a cutoff frequency of 40 Hz. The related works [41][42][43] indicated that removing EOG artifacts can improve the EEG classification rate, the blink artifacts in EEG signals were eliminated by the coherence method in this study. According to our previous work [33], the blink artifact was removed by the following equation In the equation, the EEG signal and the synchronized EOG signal at the time instant t are denoted by ˆ( ) d t and ˆ( ) where N denotes the number of samples in an EEG segment, d and ˆO d are the means of EEG signal and synchronized EOG signal in a channel, respectively. (2) The filtered EEG was divided into 2-s segments and processed with a high-pass IIR filter (cutoff frequency of 1 Hz) to remove respiratory artifacts.  The preprocessing steps of the acquired EEG signals are shown as follows: (1) All EEG signals were filtered through a three-order low-pass IIR filter with a cutoff frequency of 40 Hz. The related works [41][42][43] indicated that removing EOG artifacts can improve the EEG classification rate, the blink artifacts in EEG signals were eliminated by the coherence method in this study. According to our previous work [33], the blink artifact was removed by the following equation In the equation, the EEG signal and the synchronized EOG signal at the time instant t are denoted byd(t) andd O (t), respectively. The transfer coefficient C is defined by where N denotes the number of samples in an EEG segment,d andd O are the means of EEG signal and synchronized EOG signal in a channel, respectively.
(2) The filtered EEG was divided into 2-s segments and processed with a high-pass IIR filter (cutoff frequency of 1 Hz) to remove respiratory artifacts.  Table 1.
Finally, there were 8 subjects in the experiment, and each subject had two feature sets. Each feature set is a matrix of the same size. The number of rows of the matrix is 1800 for the number of data points, and the number of columns of the matrix is 137 for the number of features. That is, 28,800 data points are available in total. Each feature was normalized into the time course of zero mean and one standard deviation. The EEG vectors were assigned the MW labels of low and high classes and quantified as y = [ 1 0 ] and y = [ 0 1 ], respectively. Note that the first and the last 450 EEG instances in each feature matrix represent a low MW level and the remaining 900 points correspond to high MW levels.

Extreme Learning Machine
ELM is a fast-training modeling approach based on SLFN [44,45]. Figure 3 shows a typical SLFN-based ELM architecture, where x i = [x i1 , x i2 , . . . , x iq ] T is the input sample array and the corresponding output label array is y i = [y i1 , y i2 , . . . , y im ] T . Let denote the input weights, output weights, and bias of the hidden neuron as w j = [w j1 , w j2 , . . . , w jq ] T , β j = [β j1 , β j2 , . . . , β jm ] T , and b j , respectively. Training a SLFN as an ELM is equivalent to minimizing the output error between the target output y i and model outputŷ i regarding to the parameters, β j , b j and w j , i.e., In Equation (3), S is the number of EEG samples, p(·) is the activation function. The number of the input, output and hidden neurons are defined as q, m, and S, respectively. The training cost function is formulated in a least squared term, (4) Traditional neural network training approaches are mostly based on gradient descent optimization via BP algorithm. On the contrary, in the process of the ELM modeling the input weights and hidden layer bias are randomly determined while all output weights of hidden layer are computed via the norm minimization-based approach. To this end, it is unnecessary to tune the input weights and hidden layer bias in the training process.   Based on the input weights and bias that are randomly selected, the model fitting error in Equation (4) can be derived via a linear equation systems, Hβ = Y, where H is the output matrix of the hidden layer. The entry of H is denoted as follows In Equation (5), the induced local field, h Thus, the output weight β can be elicited by the generalized inverse matrix operation where H * denotes the Moore-Penrose generalized inverse of matrix H [27]. Singular value decomposition can be implemented to compute H * . The pseudo codes of the ELM training algorithm are summarized in Table 2 [46]. In the pseudo code, the training set, activation function, the number of hidden nodes, and the output weights are defined as {X, Y},p(·),S and β, respectively. The remaining parameters are consistent with the above.
Compute H according to Equation (5)  5 Compute the generalized inverse H * of

Adaboost Based ELM Ensemble Classifier
To improve the classification accuracy of MW on EEG datasets, we introduced the ELM classifier ensemble to find different individual personalities existing in EEG features. The framework of the ELM ensemble is designed based on the adaptive boosting algorithm (Adaboost) [47]. In the classical boosting algorithm [48], each training instance is given with an initial weight before the training procedure begins while the value of the weight is automatically adjusted during each iteration. In particular, the Adaboost algorithm adds a new classifier in each training iteration and additionally constructs several weak classifiers on the same sample pool. A strong classifier can thus be integrated via those weak classifiers.
To implement the Adaboost method into the ELM ensemble, we first initialized the weights of each training data of N training instances with w i = 1/N, i = 1, 2, . . . , N. That is, the initial weights of the sample are D 1 = [ w 1 , w 2 , . . . , w N ] T . Then, we ran m (m = 1, 2, . . . , M) iterations to select a basic classifier G m possessing the highest classification precision. The error rate of the selected classifier on the D m is computed by In Equation (8), x i denotes the input sample array, y i corresponds to output label array, I(·) represents statistics that x i is wrongly classified. If x i is correctly (or incorrectly) classified, I = 0 or (I = 1) exists. The weight of G m in the final strong classifier G s is computed by the following equation According to [47], the weight distribution of the training sample is updated via The output of the strong classifier G s is integrated from the weak classifiers through the weight λ m as follows In Equation (11), y s is the final output of the classifier. The pseudo codes of the Adaboost ELM is presented in Table 3 [46], where the maximum number of iterations is M, the weight of each weak classifier is λ m and the output of the strong classifier is G s (·). The remaining parameters are consistent with the above. Table 3. The pseudo codes for the Adaboost ELM algorithm.
Compute predictive output as G m (x) Update D m+1 according to Equation (

Heterogeneous Ensemble ELM
To further improve the generalization capability of the Adaboost ELM classifier on a specific participant, we adopted heterogeneous weak classifiers and the deep learning principle. The architecture of the proposed method termed as heterogeneous ensemble ELM (HE-ELM) is illustrated in Figure 4.

Heterogeneous Ensemble ELM
To further improve the generalization capability of the Adaboost ELM classifier on a specific participant, we adopted heterogeneous weak classifiers and the deep learning principle. The architecture of the proposed method termed as heterogeneous ensemble ELM (HE-ELM) is illustrated in Figure 4.
On one hand, the Naive Bayesian (NB) model was used as alternative weak classifiers to improve the diversity of the ensemble committee. The motivation behind this lies in two aspects: (1) The input weights of all member ELMs are randomly determined from a uniform distribution and it may lead to similar hidden-neuron properties across two weak classifiers; (2) the mechanism of the inference functionality for the Bayesian model is inherently different from the classical ELM and thus facilitates a heterogeneous classifier ensemble. By setting different values of the class prior probability, we obtained diverse Bayesian models. For ELMs, we implemented different activation functions and hidden neuron numbers. To this end, Bayesian and ELM models with different hyper-parameters can produce a group of dissimilar decision boundaries. The overall error of the strong classifier can be reduced after integrating all heterogeneous models.
By denoting the maximum number of the iteration as M , the whole ensemble process builds M weak classifiers, where the M classifiers consist of T NB models (denoted by ( ), 0 According to Equation 3, the output of each ELM can be expressed as On one hand, the Naive Bayesian (NB) model was used as alternative weak classifiers to improve the diversity of the ensemble committee. The motivation behind this lies in two aspects: (1) The input weights of all member ELMs are randomly determined from a uniform distribution and it may lead to similar hidden-neuron properties across two weak classifiers; (2) the mechanism of the inference functionality for the Bayesian model is inherently different from the classical ELM and thus facilitates a heterogeneous classifier ensemble.
By setting different values of the class prior probability, we obtained diverse Bayesian models. For ELMs, we implemented different activation functions and hidden neuron numbers. To this end, Bayesian and ELM models with different hyper-parameters can produce a group of dissimilar decision boundaries. The overall error of the strong classifier can be reduced after integrating all heterogeneous models.
By denoting the maximum number of the iteration as M, the whole ensemble process builds M weak classifiers, where the M classifiers consist of T NB models (denoted by G t (·), 0 < t ≤ T) and K ELMs (denoted by G k (·), 0 < k ≤ K) with T + K = M. According to Equation (3), the output of each ELM can be expressed as The prediction of the NB model is computed as and C k denote the input instance and the label of class k with k = 1, 2, . . . , K, respectively. By incorporating y s in Equation (11), the classification accuracy of the strong classifier G s (·) generated in each iteration is defined as where U(·) represents the function for measuring the number of misclassification EEG data points in all N instances. Then, the member classifier G m (·) in the m th iteration is selected by In addition, a deep network structure was applied in each member ELM and NB model aimed at finding high-level EEG feature representations. We add a new abstraction layer to the member classifier in which the network weights are trained by using local preserving projection (LPP) that preserves the local geometrical properties of the EEG data distribution [49,50]. Let denote the input weight as a transformation matrix A to map EEG feature vectors x i ∈ R q to a feature abstraction vector f i ∈ R l , i.e., By creating an adjacent graph between x i and x j , the entry e ij of edge matrix E is computed by using the Gaussian kernel with the width parameter t > 0, The input weight A of the deep ELM network can be trained by solving the following linear equation system, To this end, the output of each deep ELM classifier can be formulized as For the NB model, the dimensionality of the EEG feature vectors have been reduced and its output is computed as It is noted the final strong classifiers are generated in the last iteration of each ensemble learning process on the training set from a single participant, i.e., participant-dependent classifiers are built for MW recognition. Table 4 lists the pseudo codes for the proposed HE-ELM algorithm [46], where the dimension required to reduce the dimension of the matrix is D, the prior probability of the NB model is P pri , the output of the strong classifier is Y s and the dimensionally reduced input matrix is X L . It is noted in the initial iteration that the weak classifier G 1 (·) was constructed by an ELM model. If the performance of the newly-added classifier, i.e., another ELM G 2 (·), in the second iteration is lower than the current strong classifier, we rebuild G 2 (·) and compare it against a weak NB model according to Equation (20).
In the case that the highest performance of G 2 (·) is achieved, the next iteration is carried out. Each subsequent iteration repeats such a computational process until the final strong classifier is generated.

Results
To validate the proposed HE-ELM method for MW classification, two cases of the data-splitting paradigms were designed in Figure 5. In Case 1, the MW classifier was trained and tested across each EEG feature set of a participant. Case 1 employed an individual dependent paradigm. For each participant, two-session EEG data of 3600 instances were divided into training and testing sets of 2400 and 1200 instances, respectively. For Case 2, the training and testing datasets were generated following an individual independent principle. That is, all neurophysiological features sets of eight participants were integrated into a single database with 28,800 EEG instances. Then, the hold-out method was applied to determine two mutually exclusive training and testing sets with the sample size of 19,200 and a testing size of 9600, respectively. All algorithms were trained and tested on Matlab ® 2016a software and performed via an Intel core i5-6200U CPU @ 1.30GHz PC with Windows 7 ® operating system and 4 GB of RAM. Among them, Matlab ® 2016a software is developed and supplied by MathWorks Company in Natick, Massachusetts, USA. The PC device is produced by ASUS in Taipei, Taiwan. The ANN algorithm for performance comparison was realized by the neural network toolbox of Matlab (Ver. 9.0). size of 19,200 and a testing size of 9600, respectively. All algorithms were trained and tested on Matlab ® 2016a software and performed via an Intel core i5-6200U CPU @ 1.30GHz PC with Windows 7 ® operating system and 4 GB of RAM. Among them, Matlab ® 2016a software is developed and supplied by MathWorks Company in Natick, Massachusetts, USA. The PC device is produced by ASUS in Taipei, Taiwan. The ANN algorithm for performance comparison was realized by the neural network toolbox of Matlab (Ver. 9.0).  Figure 6 depicts the training and testing accuracy of the classical ELM classifier under Case 1 and Case 2 with a different number of the hidden neurons employed. Three different activation functions of the ELM, i.e., hard limit, sigmoid, and sine functions, were investigated. In the training phase of both cases, we found that the sigmoid and the sine functions possess the fastest and the lowest convergence rates respectively. The training accuracy of the ELM monotonously rises under all activation functions and the perfect training performance is achieved when a sufficient amount of hidden neurons is adopted. The testing accuracy of the sine function for both cases achieved the lowest value, and is just slightly better than random guess. The highest testing accuracy under the hard limit function in Figure 6a-c outperforms that of under sigmoid activation function. Specifically, in Figure 6d the hard limit function is slightly better. The observation indicates the effectiveness of the activation function for the ELM model is related to the size of the EEG feature set.  Figure 6 depicts the training and testing accuracy of the classical ELM classifier under Case 1 and Case 2 with a different number of the hidden neurons employed. Three different activation functions of the ELM, i.e., hard limit, sigmoid, and sine functions, were investigated. In the training phase of both cases, we found that the sigmoid and the sine functions possess the fastest and the lowest convergence rates respectively. The training accuracy of the ELM monotonously rises under all activation functions and the perfect training performance is achieved when a sufficient amount of hidden neurons is adopted. The testing accuracy of the sine function for both cases achieved the lowest value, and is just slightly better than random guess. The highest testing accuracy under the hard limit function in Figure 6a-c outperforms that of under sigmoid activation function. Specifically, in Figure 6d the hard limit function is slightly better. The observation indicates the effectiveness of the activation function for the ELM model is related to the size of the EEG feature set.  Figure 6. Training and testing accuracy of the ELM classifier under Case 1 (a-c) and Case 2 (d) vs. the variation of the number of hidden neurons. For Case 1, the performance of ELM on the EEG feature sets from participant S1, S2 and S3 is presented. The labels of hardlim, sigmoid, and sine indicating the hard limit, sigmoid, and sine activation function were employed, respectively.

Model Selection for HE-ELM
The number of the ELM hidden nodes under the optimal performance for each participant is summarized in Table 5. We found that the optimal number of hidden nodes for training is much higher than that for testing. It implies the perfect training performance is achieved for complex network structures while a good generalization capability corresponds to a simpler structure. For Case 1, the performance of ELM on the EEG feature sets from participant S1, S2 and S3 is presented. The labels of hardlim, sigmoid, and sine indicating the hard limit, sigmoid, and sine activation function were employed, respectively. The number of the ELM hidden nodes under the optimal performance for each participant is summarized in Table 5. We found that the optimal number of hidden nodes for training is much higher than that for testing. It implies the perfect training performance is achieved for complex network structures while a good generalization capability corresponds to a simpler structure.  Figure 7, the average computational cost for training and testing an ELM classifier is presented. It is shown both of the training and testing time is positively correlated with increase of the number of hidden nodes. The time cost monotonously rises under all activation functions for two cases and the testing time duration is much smaller than that for the training. By observing the line plots, the computational burden under Case 1 is less than that under Case 2 because of a smaller sample size. We can note that the testing time under hard limit function achieves the least value in Figure 7b,d,f,h. There is no obvious difference for sine and sigmoid functions in Figure 7b,d,f under Case 1. It is noted the testing time of the ELM classifier under sine function achieves the highest value in Figure 7h while the time cost across three activation functions in the training stage does not significantly vary.
In order to select the optimal model structure for the proposed HE-ELM, we investigated its performance under different values for the number of the iterations in Table 6. The shallow ELM weak classifier was used to reduce the training time. For simplicity, the individual-dependent MW classifiers shared the same network structure. The optimal structure of the ELM weak classifier is shown in Table 7. It is shown the hardlim function with 201 hidden neurons is the best case. By comparing the two tables, it is observed that the ensemble ELM approach improves the accuracy by 1.43% against an optimally-trained ELM. plots, the computational burden under Case 1 is less than that under Case 2 because of a smaller sample size. We can note that the testing time under hard limit function achieves the least value in Figure 7b,d,f,h. There is no obvious difference for sine and sigmoid functions in Figure 7b,d,f under Case 1. It is noted the testing time of the ELM classifier under sine function achieves the highest value in Figure 7h while the time cost across three activation functions in the training stage does not significantly vary. In order to select the optimal model structure for the proposed HE-ELM, we investigated its performance under different values for the number of the iterations in Table 6. The shallow ELM weak classifier was used to reduce the training time. For simplicity, the individual-dependent MW classifiers shared the same network structure. The optimal structure of the ELM weak classifier is shown in Table 7. It is shown the hardlim function with 201 hidden neurons is the best case. By comparing the two tables, it is observed that the ensemble ELM approach improves the accuracy by 1.43% against an optimally-trained ELM.

Accuracy Comparison between HE-ELM and Different MW Classifiers
We implemented the proposed HE-ELM to generate participant-specific MW classification testing accuracy. In Figure 8, the performance of all subjects for HE-ELM is compared against the deep ELM and the classical ELM under an optimal network structure. The histogram indicates the accuracy of the HE-ELM, and the ELM classifiers achieve the highest and the lowest values, respectively. It is shown that the performance improvement of the HE-ELM is particularly significant for subjects D and E. The number of the iterations and the proportion of the weak classifiers of each type are listed in Table 8. From the Table, it is observed that the number of iterations and the percentages on deep ELM and NB classifiers are different across 8 subjects. The proportion of ELM is much larger than that of NB. In particular, for subject S6, only a single iteration was performed. It is because the first weak classifier had already achieved an optimal classification performance while the Adaboost algorithm cannot further improve accuracy.

Accuracy Comparison between HE-ELM and Different MW Classifiers
We implemented the proposed HE-ELM to generate participant-specific MW classification testing accuracy. In Figure 8, the performance of all subjects for HE-ELM is compared against the deep ELM and the classical ELM under an optimal network structure. The histogram indicates the accuracy of the HE-ELM, and the ELM classifiers achieve the highest and the lowest values, respectively. It is shown that the performance improvement of the HE-ELM is particularly significant for subjects D and E. The number of the iterations and the proportion of the weak classifiers of each type are listed in Table 8. From the Table, it is observed that the number of iterations and the percentages on deep ELM and NB classifiers are different across 8 subjects. The proportion of ELM is much larger than that of NB. In particular, for subject S6, only a single iteration was performed. It is because the first weak classifier had already achieved an optimal classification performance while the Adaboost algorithm cannot further improve accuracy.   Next, we compared the performance of the HE-ELM classifier against the classical ELM, K-nearest neighbor (KNN), artificial neural network with single hidden layer (ANN), denoising autoencoder (DAE), logistic regression (LR), Adaboost based on the decision tree, stacked denoising autoencoder (SDAE), and NB MW prediction models. In addition, the LPP-based feature mapping is adopted to produce eight deep-structured classifiers denoted by deep ELM, LPP-KNN, LPP-ANN, LPP-DAE, LPP-LR, LPP-AD, LPP-SDAE, and LPP-NB.
In Figure 9, we examine the performance of the classifiers used for comparison with different hyper-parameters on four representative subjects. It is noted the NB and LR based model have none of the hyper-parameters controlling the model complexity and thus have not been analyzed in the figure. From Figure 9, it is shown the testing accuracy of ELM, KNN, deep ELM, and LPP-KNN classifier arises with the model complexity. Specifically, the maximum testing accuracy of deep ELM is higher than that of ELM in all four subjects. On the other hand, the optimal performance of LPP-KNN is better than that of KNN. Compared with ANN, LPP-ANN achieves better optimal performance while the accuracy of the DAE model is similar to that of LPP-DAE. of the hyper-parameters controlling the model complexity and thus have not been analyzed in the figure. From Figure 9, it is shown the testing accuracy of ELM, KNN, deep ELM, and LPP-KNN classifier arises with the model complexity. Specifically, the maximum testing accuracy of deep ELM is higher than that of ELM in all four subjects. On the other hand, the optimal performance of LPP-KNN is better than that of KNN. Compared with ANN, LPP-ANN achieves better optimal performance while the accuracy of the DAE model is similar to that of LPP-DAE. The optimal testing classification accuracy generated by 17 classifiers is illustrated in Figure 10 via box plots. For all methods, HE-ELM and LPP-KNN achieve the best and the worst average accuracy, respectively. The classification performance of the MW classifiers except for KNN is improved when using LPP feature leaning. It implies the generalization capability of shallow classifiers can be enhanced via a layer of intermediate feature representation. The mean of accuracy, precision, and recall of the MW classifiers on eight subjects is shown in Table 9. The table specifically presents the difference in the classification performance of the 17 MW classifiers. It is shown that the mean values of the three evaluation indicators of HE-ELM are all the highest while the standard The optimal testing classification accuracy generated by 17 classifiers is illustrated in Figure 10 via box plots. For all methods, HE-ELM and LPP-KNN achieve the best and the worst average accuracy, respectively. The classification performance of the MW classifiers except for KNN is improved when using LPP feature leaning. It implies the generalization capability of shallow classifiers can be enhanced via a layer of intermediate feature representation. The mean of accuracy, precision, and recall of the MW classifiers on eight subjects is shown in Table 9. The table specifically presents the difference in the classification performance of the 17 MW classifiers. It is shown that the mean values of the three evaluation indicators of HE-ELM are all the highest while the standard deviation of the accuracy of HE-ELM is the lowest among all MW classifiers. In particular, LPP-KNN has the lowest mean of accuracy and recall while the average recall of the NB model is the lowest. The number of hidden layer nodes, the k values for KNN, the number of weak classifiers, and the proportion of deep ELM classifiers for each subject is summarized in Table 11. As we noted, the number of hidden neurons under the optimal classification of ELM and deep ELM were lower than that of ANN and LPP-ANN, respectively. However, the DAE-based MW classifier possessed the simplest structure with a minimum number of hidden nodes. For LPP-AD, the average number of iterations was lower than Adaboost. The number of neurons on the two hidden layers of LPP-SDAE The results of two-tailed test between HE-ELM and other 16 MW classifiers are listed in Table 10. It is found that HE-ELM has a significant improvement in accuracy compared to other classifiers except for LPP-SDAE. For precision and recall values, HE-ELM are comparable with six deep structure classifiers generated based on LPP-based feature mapping except for LPP-KNN and LPP-ANN. In general, the accuracy of the HE-ELM is superior to most pattern classifiers and is comparable with the advanced deep learning method.

Computational Cost Comparison
The average CPU time for training and testing a MW classifier on the EEG feature set for a single subject is shown in Table 12. From the table, the LPP-NB and Adaboost achieved the lowest and the highest computational burden. Adaboost had the highest computational cost mainly because it required a large number of iterations on the data set of this experiment. The main reason for the high computational overhead of HE-ELM is that HE-ELM method tried to select suitable weak classifiers in each iteration and each weak classifier can be either an ELM or a NB-based classification model. Thus, the shortcoming of HE-ELM is its high computational cost in training although it effectively improves the MW classification accuracy compared to the classical ELM algorithm. For other classifiers, using LPP for dimensionality reduction can significantly reduce the computational cost. A high computational burden can be observed in the machine learning approaches using gradient decent optimization, e.g., DAE, SDAE, and ANN.

Visualization of the Intermediate EEG Feature Representations
The intermediate EEG features abstracted in hidden layers of the deep ELM in HE-ELM are shown in Figure 11. To facilitate visualizing the abstraction distribution, three representative hidden variables from the EEG dataset of two subjects were selected to be illustrated in 3-D scatter plots. It was shown the abstraction vectors corresponding to low and high MW levels can be clearly distinguished after EEG features were properly mapped via hidden units of the weak classifier in HE-ELM. In the 1st hidden layer, the 137 EEG features (the first three features are shown in Figure 11a,d) were fused into 27 hidden variables (the first three variables are plotted in Figure 11a,b,d,e). The first abstraction layer of the neural network of Deep ELM was added by LPP method. It is noted the abstraction vectors in Figure 11c,f are more concentrated than that shown in Figure 11a,d. It implies the deep architecture is helpful for salient information fusion when EEG feature from multiple domains are available.

Discussion
The proposed HE-ELM for MW assessment integrates ELMs with classical NB models to form a heterogeneous strong classifier. To ensure the effectiveness of the integrated ensemble committee, the hyper-parameters of each weak classifier were set to be different. On the premise of ensuring the model diversity, the iteration programming was performed until it reached the highest classification performance. To find the optimal structure of the HE-ELM, we first implemented the Adaboost-ELM and observed that the ELM ensemble outperformed the classical ELM when a moderate amount of iterations was used. However, the performance improvement was marginal since the data distribution of the EEG features shared a great individual difference. Therefore, we implemented a deep ELM network structure to find the stable EEG feature representation combined with an individual-specific (or subject-dependent) classification paradigm.
According to the existing literature related to the machine-learning-based EEG classification, a group of effective classifiers were validated such as NB, ANN, KNN algorithm, DAE, and LR. We compared the HE-ELM against the above approaches and the former achieved the highest performance. Several observations were found when tackling high dimensional EEG features. The KNN method easily over-fits when the k value is set too small [51]. On the other hand, its computational complexity increases with a large k used. The performance of the ANN with a single hidden layer is unstable because its shallow structure is incapable of fusing the noised EEG features. By adding noised training samples to make the output similar to the input signal [52], DAE achieves better performance than ANN. The LR model maps the range of the original output results to an interval of (0,1) by a logistic sigmoid function [53]. However, the lack of the intermediate feature representation limits the adaptability of DAE and LR on individual-dependent EEG feature sets.

Discussion
The proposed HE-ELM for MW assessment integrates ELMs with classical NB models to form a heterogeneous strong classifier. To ensure the effectiveness of the integrated ensemble committee, the hyper-parameters of each weak classifier were set to be different. On the premise of ensuring the model diversity, the iteration programming was performed until it reached the highest classification performance. To find the optimal structure of the HE-ELM, we first implemented the Adaboost-ELM and observed that the ELM ensemble outperformed the classical ELM when a moderate amount of iterations was used. However, the performance improvement was marginal since the data distribution of the EEG features shared a great individual difference. Therefore, we implemented a deep ELM network structure to find the stable EEG feature representation combined with an individual-specific (or subject-dependent) classification paradigm.
According to the existing literature related to the machine-learning-based EEG classification, a group of effective classifiers were validated such as NB, ANN, KNN algorithm, DAE, and LR. We compared the HE-ELM against the above approaches and the former achieved the highest performance. Several observations were found when tackling high dimensional EEG features. The KNN method easily over-fits when the k value is set too small [51]. On the other hand, its computational complexity increases with a large k used. The performance of the ANN with a single hidden layer is unstable because its shallow structure is incapable of fusing the noised EEG features. By adding noised training samples to make the output similar to the input signal [52], DAE achieves better performance than ANN. The LR model maps the range of the original output results to an interval of (0,1) by a logistic sigmoid function [53]. However, the lack of the intermediate feature representation limits the adaptability of DAE and LR on individual-dependent EEG feature sets.
In the literature, Ayaz et al. measured operator MW in an unmanned aerial vehicle control task. Based on one-way repeated measures ANOVA analysis of the fNIRS data, significant differences between the MW indices corresponding to different task demands have been found [16]. In our previous work, the RBF-kernel and the linear-kernel based SVM were utilized for assessing binary MW [54]. The average classification accuracy was obtained from 0.7912 to 0.9317. An adaptive Stacked Denoising AutoEncoder (A-SDAE) was used in the binary MW classification problem in [21]. By employing the dimensionality of EEG features of 55, the mean value of accuracy on seven subjects was 0.8579. In the current work, the average accuracy of the HE-ELM algorithm finally obtained on all eight subjects was 0.9384. The improvement of the HE-ELM performance with ensemble learning principle was partially found. Because the training and testing environment was different, it is difficult to achieve strict performance analysis.
Although the HE-ELM improves the performance of the standard ELM as well as the LPP-based deep ELM, the main limitation is that the technology is an offline simulation and its training time cost is significantly higher than single, shallow machine learning models. In particular, the classification method proposed in this paper is for offline simulation and was not tested in an online fashion. It leads to a limitation that the proposed algorithm requires the entire dataset and cannot be scheduled in real time. Moreover, the training time cost is significantly higher than single, shallow machine learning models and this problem may impair the effectiveness for retraining the MW classifier on the EEG data from a novel operator of the human-machine system. When the size of sample is too large, the number of required iterations for HE-ELM will increase since more suitable weak classifiers need to be generated. However, the computational cost of HE-ELM was mainly introduced in the training stage. When the trained HE-ELM model is used for testing unseen EEG feature vectors, the required time is still comparable to shallow MW classifiers. It is also noted that an additional hyper-parameter of HE-ELM was introduced, i.e., the optimal number of iterations. By observing Table 11, we found it affects the value of the accuracy when individual-specific MW classifier ensemble was employed. The technical challenge of this paper is how to find the appropriate NB models to iterate with ELM models, so as to effectively improve the classification performance. In many cases, the selected NB model may reduce the classification performance of the final strong classifier. The other possible limitation is that the size of the training data is insufficient to achieve stable online performance. Therefore, it should be noted the accuracy cannot be guaranteed in online, real world application developments. In future work, the online MW recognition of the proposed method should be investigated and validated.
As a future direction, it is promising to investigate whether a generic MW classifier ensemble can be found for all individuals in an online fashion. It is noted that the fNIRS is an optical brain monitoring technique that has been used to evaluate MW in ecologically valid environments [16]. The merits of fNIRS are its low invasiveness for data acquisition and high reliability. However, the current fNIRS device is not suitable for monitoring all cortical areas while the time resolution of the acquired data need to be enhanced. The BCI-based MW assessment system that fuses both the EEG and fNIRS modalities is a promising solution. Such a hybrid system has been documented in Babiloni et al. [55]. The ensemble learning based approach such as our HE-ELM may be helpful for fusing features from two different domains. The possible obstacles can be the synchronicity of the EEG and fNIRS features, the higher computational burden, and the difference of the sampling frequency on two modalities when the online MW assessment is implemented.

Conclusions
In this paper, we present a new machine-learning based OFS evaluator, HE-ELM, to classify EEG features into binary levels of MW. Under an individual-specific modeling paradigm, the HE-ELM was constructed by heterogeneous deep ELM and NB weak classifiers. Two types of the member classifier structure were employed to enhance the diversity of the classification committee generated via the Adaboost algorithm. To find the proper hyper-parameters of the proposed ensemble model, the number of hidden nodes and the optimal activation function for HE-ELM were identified. To validate the effectiveness of HE-ELM, we introduced eight EEG feature sets via the hold-out method to build training and testing sets. The classical single, and shallow MW classifiers were employed for comparison on the classification accuracy and the computational cost. We found the HE-ELM model superior in improving the individual-specific accuracy of MW assessments although it is at the cost of the training burden. The HE-ELM algorithm in this paper is an offline simulation, but it may be possible to solve this problem in the future by combining it with fNIRS technology. The combined EEG and fNIRS hybrid system may have some disadvantages such as computational cost. In general, the combined EEG and fNIRS hybrid system is worth a try in the assessment of MW in the future.