Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach

Shao, Xiaojun; Ma, Xiaoxiang; Chen, Feng; Pan, Xiaodong

doi:10.3390/app16073581

Open AccessArticle

Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach

¹

School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China

²

The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3581; https://doi.org/10.3390/app16073581

Submission received: 31 July 2025 / Revised: 13 September 2025 / Accepted: 14 September 2025 / Published: 7 April 2026

(This article belongs to the Special Issue Artificial Intelligence in Transportation Safety and Traffic Management)

Download

Browse Figures

Versions Notes

Abstract

Understanding and monitoring driver mental workload is essential for improving road safety. This study proposes a multimodal machine learning framework to classify drivers’ mental workload using eye movement metrics, physiological signals, and driving behavior features. A driving simulator experiment was conducted with 26 participants under two workload levels induced by a secondary auditory task. Seven feature combinations and six classification algorithms were evaluated. The results showed that eye metrics were the most informative modality, and that feature selection had a greater impact on classification performance than algorithm choice. A support vector machine with optimized features was selected as the final model based on performance and stability, achieving an accuracy of 87.8% and an AUC of 0.95. To improve model transparency, SHapley Additive exPlanations (SHAP) was applied, highlighting key predictors such as blink rate and heart rate, and uncovering synergistic effects between visual and physiological variables. The model was further validated in a tunnel entrance scenario, where it identified increased workload associated with steeper longitudinal slopes. These findings emphasize the importance of multimodal data integration—particularly eye movements—for assessing mental workload. Future applications should prioritize feature diversity over algorithm complexity to enhance real-world implementation in workload monitoring systems.

Keywords:

mental workload classification; driving simulator; feature selection; eye measures

1. Introduction

Mental workload is a widely studied concept in human factors engineering and ergonomics, often used to describe the relationship between task demands and individual’s cognitive state [1]. In the driving context, “task demand” refers to the effort required to maintain safe vehicle operation, while the “individual state” pertains to the driver’s perception of risks and control over the vehicle. Both factors continuously adapt in response to varying driving tasks (e.g., overtaking, phone-answering), thereby influencing mental workload levels [2,3]. Although drivers may not be immediately aware of subtle shifts in their mental workload, the cumulative effect of fluctuations—ranging from underload to overload—can impair information processing and reaction times, ultimately compromising road safety [4]. Consequently, the timely and accurate evaluation, and even prediction, of driver mental workload is essential for maintaining safe driving performance.

Traditional methods for evaluating mental workload include subjective ratings, task performance measures, and physiological indicators [1]. Subjective ratings capture a driver’s personal perceptions and can be categorized into real-time and post-task assessments. Real-time assessments record the driver’s feelings during the task, offering timely insights but potentially interfering with driving performance [5]. In contrast, post-task assessments—typically administered through questionnaires—avoid task interference but rely on the participant’s memory and may introduce recall bias [6]. This trade-off between the intrusiveness of real-time evaluation and the retrospective bias of post-task assessment raises concerns about the reliability of subjective methods [7]. Therefore, incorporating objective measures is essential for improving the accuracy and validity of mental workload evaluation.

Among objective measures, task performance metrics are commonly used and can be divided into primary and dual-task performance. Primary task performance refers to direct assessments of driving behavior, such as lane-keeping (e.g., lateral deviation) and reaction time. Dual-task performance, on the other hand, involves the introduction of a secondary task during the driving experiment to indirectly evaluate mental workload based on the driver’s performance on that secondary task. These secondary tasks—commonly referred to as non-driving-related tasks (NDRTs)—can take various forms, including auditory, visual, or memory-based tasks [2,8,9]. Superior performance on the secondary task suggests greater spare cognitive capacity and, by extension, a lower level of mental workload. Another widely used objective approach is physiological measurement, which captures mental workload by continuously recording changes in the driver’s physiological state with high temporal resolution. Common indicators include cardiac activity, brain activity, eye movements, and galvanic skin response, all of which have been frequently employed in studies assessing driver mental workload [10,11,12]. Recent reviews have further highlighted the role of multimodal physiological sensing in providing reliable and real-time workload estimation across domains, while also noting challenges such as standardization and generalization [13].

The aforementioned methods provide a solid foundation for assessing driver mental workload, but they are often limited in their ability to achieve accurate quantification and real-time prediction when used in isolation. Recent research has increasingly focused on applying machine learning techniques to enhance the evaluation of mental workload by integrating subjective assessments, task performance metrics, and physiological indicators. This multimodal approach enables more precise and scalable workload classification. Table 1 summarizes recent studies on machine learning-based driver mental workload assessment, highlighting key aspects such as feature selection and classification algorithms.

As shown in Table 1, studies vary in their emphasis on different feature categories during the development of mental workload classification models. Some studies [16,18,19] focus exclusively on physiological signals, viewing them as the most direct indicators of mental workload. In contrast, other studies suggest that combining physiological data with performance metrics or subjective ratings yields a more comprehensive understanding [14,15,17,20,21]. Although the importance of physiological features is widely acknowledged, feature selection remains subjective and highly context-dependent, often shaped by specific research goals and driving scenarios. This underscores the need for further investigation into the relative contribution of each feature category to classification accuracy. Algorithm choice also varies considerably across studies. While some researchers opt for simple and interpretable models such as logistic regression [14], others employ more complex approaches like Recurrent Neural Network [16,19]. However, there is still no consensus on which machine learning algorithm consistently delivers the best classification performance [22].

Therefore, this study aims to clarify the relative importance of feature selection and algorithm choice in developing robust and accurate driver mental workload classification models. Specifically, the following research questions are addressed:

a. How should features be selected and balanced when constructing driver mental workload models?

b. Do different algorithm choices lead to significant differences in model performance?

c. Which factor—feature selection or algorithm choice—has a greater influence on model performance?

To systematically compare the effects of different feature sets and algorithms, a driving simulator experiment was conducted under two workload conditions. Data were collected on driving performance, eye movements, and physiological signals, and categorized into seven feature groups. Six machine learning algorithms were then applied to construct the classification models. The best-performing model was interpreted using SHAP and further validated in a tunnel-driving case study involving variations in road slope and lighting conditions. In doing so, the study is expected to contribute new insights into the relative importance of feature design compared with algorithm selection in workload modeling.

2. Materials and Methods

2.1. Participants

A total of 35 participants were recruited for this driving simulator study through online forums and designated driving service platforms. Due to synchronization issues and abnormal signal amplitudes in the physiological data, the recordings from 9 participants were excluded. The final sample consisted of 26 participants (5 females, 21 males), with a mean age of 28.3 ± 6.76 years and an average driving experience of 6.88 ± 6.11 years. All participants were in good physical and mental health, and no cases of simulation sickness were reported during the experiment.

2.2. Apparatus

As illustrated in Figure 1, the experiment was conducted using a 3-degree-of-freedom motion-based driving simulator, which recorded driving behavior at a sampling rate of 50 Hz. The simulator’s visual system included a circular screen providing a 180-degree forward field of view, supplemented by three liquid crystal displays for the rearview mirrors. Eye-tracking data were collected using the Dikablis Glasses 3 (Ergoneers GmbH, Egling, Germany), operating at 60 Hz. Physiological signals, including electrocardiogram (ECG) and skin conductance level (SCL), were recorded using a PhysioLab wireless device (Ergoneers GmbH, Egling, Germany) at a sampling rate of 1000 Hz. All data streams were synchronized and recorded in real time using D-Lab software (Version 3.0) to ensure precise temporal alignment across modalities.

2.3. Experiment Design

In this study, participants drove along simulated rural roads while their driving performance, eye movements, and physiological signals were continuously recorded. The rural road environment consisted of open road segments and tunnel sections. A secondary task was introduced during the open road segments to impose additional cognitive workload. The tunnel segments were designed to examine driver behavior under varying longitudinal slopes and lighting conditions, as detailed in a previous study [23]. The present study focuses on data collected from the open road segments to investigate driver mental workload, and further utilizes the tunnel sections for workload prediction. Tunnels are widely recognized as safety-critical road environments where reduced visibility, constrained geometry, and changes in slope and illumination can naturally increase drivers’ cognitive demand. Therefore, incorporating tunnel scenarios into the analysis provides a representative setting for workload prediction. In this experiment, each tunnel segment was 1000 m in length, ensuring sufficient exposure for drivers to adapt to and respond under these conditions.

The open roads featured 3.75 m wide lanes in each direction and served as connectors between tunnels. Two road segment lengths were used: 900 m and 2000 m. In the 900 m segments, no secondary task was administered. In the 2000 m segments, an auditory 2-back memory task was employed to induce cognitive workload [24]. Participants listened to a sequence of digits (0–9) presented at 2.25 s intervals through a pre-programmed system. Starting from the third digit, they were required to verbally recall the digit that appeared two positions earlier in the sequence.

The driving scenarios were designed to primarily involve operational-level driving. In both conditions, participants were instructed to remain in the left lane and drive naturally, adhering to the posted speed limit of 80 km/h and following their habitual driving style. Each participant completed four driving sessions, with each session including two secondary task scenarios. An overview of the experimental design for each drive is presented in Figure 2.

2.4. Procedure

Upon arrival at the laboratory, participants were given a brief orientation to ensure a clear understanding of the experimental procedure. They were then asked to complete a questionnaire collecting demographic and driving-related information, followed by signing an informed consent form. According to the institutional and national regulations for non-invasive behavioral experiments (e.g., driving simulator studies), ethics committee approval was not required for this study. Next, participants underwent a training session to familiarize themselves with both the secondary task and the driving simulator. Once participants demonstrated sufficient comfort and proficiency with the simulator, they were equipped with eye-tracking glasses and physiological sensors in preparation for the experimental drives. During the experimental sessions, participants were instructed to drive in accordance with their usual habits and could take breaks between scenarios if requested. At the end of the experiment, each participant received 200 RMB as a reimbursement.

2.5. Classification Models

All datasets were labeled according to two workload levels: normal workload (without a secondary task) and elevated workload (with a secondary task), thereby framing the classification task as a binary problem. Drawing on insights from related studies (as summarized in Table 1) and taking into account the dataset size, six representative machine learning algorithms were selected for model development: Logistic Regression, Naive Bayes, K-Nearest Neighbors, Random Forest, Support Vector Machine, and XGBoost. These algorithms encompass both traditional and advanced methods, offering a diverse range of capabilities suited to various data characteristics and model complexities.

2.5.1. Logistic Regression

Logistic Regression is a linear model commonly used for binary classification tasks [14]. It estimates the probability of class membership by applying a logistic (sigmoid) function to a weighted sum of the input features. Model training involves optimizing a log-likelihood function, typically using methods such as gradient descent. Logistic Regression is particularly suitable for scenarios where the relationship between the input variables and the output is assumed to be linear on the log-odds scale, making it a widely adopted baseline in classification problems.

2.5.2. Naive Bayes

Naive Bayes is a probabilistic classification algorithm based on Bayes’ theorem, which assumes conditional independence among input features given the class label [25]. It computes the posterior probability of each class by combining prior probabilities with the likelihood of observed features, and assigns the instance to the class with the highest posterior probability. Due to its simplicity, scalability, and effectiveness with high-dimensional data, Naive Bayes is widely used in both binary and multiclass classification tasks.

2.5.3. K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm that classifies samples based on similarity in the feature space [26]. For a given input, the algorithm identifies the K nearest neighbors—typically using Euclidean distance—and assigns the majority class among those neighbors as the predicted label. Unlike other algorithms, KNN does not involve an explicit training phase; instead, it relies on the entire dataset during prediction. KNN is versatile and can be applied to both classification and regression tasks, though its performance is sensitive to the choice of K and the distance metric.

2.5.4. Random Forest

Random Forest is an ensemble learning algorithm that builds a collection of decision trees using bootstrap samples of the training data [27]. To enhance model diversity, a random subset of features is considered at each split within each tree. For classification tasks, the final prediction is determined through majority voting across all trees. Random Forest is well-suited for handling high-dimensional data and is effective at capturing non-linear relationships and complex feature interactions, while also being relatively robust to overfitting.

2.5.5. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a supervised learning algorithm designed to identify the optimal hyperplane that maximally separates data points from two classes [28]. In the case of linearly separable data, SVM maximizes the margin between the support vectors and the decision boundary to improve generalization. For non-linear classification problems, SVM employs kernel functions

K (x_{i}, x_{j})

to implicitly map the input data into a higher-dimensional feature space, where a linear separation may become feasible. This flexibility allows SVM to effectively handle both linear and non-linear classification tasks. The SVM optimization objective is:

\min_{w, b, ξ_{i}} \frac{1}{2} {‖w‖}^{2} + C \sum_{i = 1}^{N} ξ_{i}

(1)

s . t . y_{i} (w^{T} \cdot ϕ (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0, i = 1, 2, \dots, N

where

ϕ (x_{i})

is the mapping function defined by the kernel,

ξ_{i}

are slack variables allowing misclassification, and

C

is the penalty parameter controlling the trade-off between maximizing the margin and minimizing the classification error.

2.5.6. XGBoost

The Extreme Gradient Boosting algorithm XGBoost is an optimized implementation of the gradient boosting framework, specifically designed for decision tree-based models [29]. It constructs additive models in a sequential manner, where each new tree is trained to minimize a regularized objective function comprising both a loss term and a model complexity penalty. XGBoost leverages second-order Taylor expansion to approximate the loss function during optimization and incorporates advanced techniques such as column subsampling and learning rate shrinkage to enhance generalization and reduce overfitting. It is widely recognized for its efficiency, scalability, and high predictive accuracy, making it a popular choice in many classification and regression tasks.

2.6. SHAP Explanation Model

SHAP (SHapley Additive exPlanations) is an additive feature attribution method rooted in cooperative game theory. It provides consistent and locally accurate explanations by computing the contribution of each input feature to a model’s output. Unlike traditional feature importance rankings, SHAP quantifies both the magnitude and direction of each feature’s impact on the prediction. The SHAP explanation model expresses the model output

f (x)

as a linear sum of feature contributions [30]:

f (x) = g (x) = ϕ_{0} + \sum_{i = 1}^{M} ϕ_{i}

(2)

In Equation (2),

f (x)

denotes the predicted value for input

x

,

g (x)

is the SHAP explanation model,

ϕ_{0}

is the model output for a baseline input (usually the mean prediction over the training data), and

ϕ_{i}

represents the contribution of feature

i

to the deviation from the baseline. For a given feature

i

, its SHAP value

ϕ_{i}

is computed by averaging its marginal contributions across all possible feature subsets

S \subseteq N \ \{i\}

:

ϕ_{i} = \sum_{S \subseteq N \ \{i\}} \frac{|S|! \cdot (M - |S| - 1)!}{M!} [f_{S \cup \{i\}} (x) - f_{S} (x)]

(3)

Here,

N

is the full set of features, and

f_{S} (x)

is the expected model output when only features in subset

S

are known. The term

\frac{|S|! \cdot (M - |S| - 1)!}{M!}

serves as a weighting factor to fairly account for all possible permutations in which feature

i

could be added to the subset

S

.

SHAP values offer a theoretically grounded approach to interpreting model predictions by considering all possible combinations of feature presence and absence. For each instance, the SHAP value of a feature quantifies its contribution to the model output, averaged across all possible feature subsets. This framework enables consistent and locally accurate explanations for both linear and complex non-linear models. In this study, the KernelSHAP method was employed to estimate SHAP values for the support vector machine model. KernelSHAP is a model-agnostic approach that approximates SHAP values through weighted linear regression over a set of randomly sampled feature coalitions, making it suitable for interpreting black-box classifiers, including non-tree-based models [31].

3. Model Development and Results

3.1. Data Pre-Processing and Feature Generation

Driving performance, eye movements, and physiological signals were collected during the experiment to capture multiple dimensions of driver mental workload. From the driving performance data, six features were generated: mean and standard deviation of speed, mean and standard deviation of lateral position, average position of the gas pedal, and maximum position of the brake pedal. These features were selected because variations in speed regulation, lateral control, and pedal input have been consistently linked to changes in driver workload levels [32,33].

From the eye movement data, two features were extracted: blink rate and the standard deviation of horizontal gaze position. Blink rate is widely considered as a reliable indicator of visual and cognitive workload, often increasing as task demands rise [34]. Horizontal gaze variability captures how frequently and broadly drivers scan their surroundings, with reduced variability potentially signaling elevated mental workload [35].

From the physiological signals, four features were derived: heart rate, heart rate growth rate, the standard deviation of inter-beat intervals (SDNN), and the standard deviation of skin conductance level (SCL). These metrics were selected based on their established association with sympathetic nervous system activation and emotional arousal, both of which are closely linked to mental workload [20,36].

To minimize potential multicollinearity among the 12 extracted features, Pearson’s correlation analysis was conducted. The results indicated that all pairwise correlation coefficients were below 0.65, suggesting acceptable independence among features.

To examine the influence of feature selection on model performance, the dataset was organized into seven feature groups based on modality combinations: driving performance only (Perf), eye movements only (Eye), physiological signals only (Phys), driving performance and eye movements (Perf_Eye), driving performance and physiological signals (Perf_Phys), eye movements and physiological signals (Eye_Phys), and all features combined (All).

3.2. Model Development and Evaluation

To ensure consistency in data distribution, the dataset was split into a training set (80%) and a testing set (20%) for model evaluation. A 10-fold cross-validation method was applied to the training set to optimize hyperparameters and prevent overfitting, with the process repeated 10 times (10 × 10 cross-validation) to enhance robustness. Model performance was evaluated using two key metrics: Accuracy (ACC), representing the proportion of correctly classified instances among the total instances, and the Area Under the Receiver Operating Characteristic Curve (AUC), which reflects the model’s ability to discriminate between classes. Hyperparameters for each algorithm were tuned via a grid-search approach during cross-validation to identify the optimal configuration for maximizing both ACC and AUC.

To evaluate the effects of feature selection and algorithm choice on model performance, a one-way analysis of variance (ANOVA) was conducted. Bonferroni correction was applied in the post hoc analysis to control for multiple comparisons. The significance level for all statistical tests was set at 0.05.

The model performance results are listed in Table 2. The ANOVA results indicated significant differences among the feature sets (ACC: F(6, 4193) = 855, p < 0.001; AUC: F(6, 4193) = 995, p < 0.001) and among the algorithms (ACC: F(5, 4194) = 45, p < 0.001; AUC: F(5, 4194) = 49, p < 0.001).

The following section is structured to first present differences in model performance across feature sets (Section 3.2.1), where results from different algorithms were averaged within each feature group (Figure 3) and then further broken down by individual algorithms (Figure 4). This is followed by an examination of performance variations across algorithms (Section 3.2.2), where results were averaged across feature sets (Figure 5) and subsequently detailed by feature groups (Figure 6). This two-step design ensures a comprehensive evaluation, allowing feature- and algorithm-level effects to be examined separately, and provides clearer insights for developing robust driver mental workload classification models.

3.2.1. Differences in Model Performance Across Feature Sets

The post hoc analysis of model performance metrics across different feature sets is presented in Figure 3. The results for both metrics were highly consistent. Except for the “Phys vs. Perf_Phys” and “Eye_Phys vs. All” pairs, significant differences were observed among the rest of the feature sets (AUC: Eye_Phys vs. All, p < 0.01; others, p < 0.001).

Notably, the classification model using only driving performance features yielded the lowest accuracy (57.94%) and AUC (0.6227) among all feature sets. This finding may help explain why prior studies on driver mental workload assessment (see Table 1) have often prioritized visual and physiological indicators over performance-based measures. Physiological metrics are generally regarded as more sensitive and directly reflective of cognitive load. In contrast, models incorporating combined feature sets consistently demonstrated superior performance, with significantly higher classification accuracy. Feature sets that included eye movement metrics outperformed those composed solely of driving performance or physiological signals. This finding highlights the critical role of visual indicators—such as blink rate and horizontal gaze variability—in detecting subtle shifts in cognitive demand that may not be captured by performance or physiological features alone. Similar trends have been reported in previous studies [14,19]. However, while adding more features generally improved model performance, these gains were not unlimited. A marginal increase in AUC was observed when transitioning from the “Eye_Phys” feature set to the more comprehensive “All” set, suggesting diminishing returns with the inclusion of additional parameters. This finding indicates that rather than maximizing feature quantity, a more effective strategy may involve carefully selecting and combining the most informative features to enhance model performance and practical applicability.

Figure 4 presents the post hoc testing results for different feature–algorithm combinations, offering deeper insight into whether certain algorithms are inherently better suited to exploit the synergies provided by multimodal datasets. For this analysis, only feature sets that included eye movement data—namely Eye, Perf_Eye, Eye_Phys, and All—were considered, as these consistently outperformed feature sets lacking eye movement metrics. This focused comparison allows for a more targeted evaluation of how algorithm choice interacts with high-performing, visually enriched feature combinations.

Figure 4 illustrates that model performance varies systematically across different combinations of feature sets and algorithms, providing insight into the interaction between model complexity and data richness. For simpler models such as Logistic Regression and Naive Bayes, the incorporation of more diverse and complex feature sets led to notable improvements in both accuracy and AUC. This suggests that these models benefit substantially from richer informational input, as also observed in previous studies [37,38]. In contrast, the performance of KNN showed a less consistent pattern. While models using strictly visual features underperformed in terms of accuracy, the inclusion of all feature types led to an improvement in AUC. This indicates that KNN may be more sensitive to the absence of key feature domains than to the presence of potentially redundant information, highlighting the algorithm’s dependence on local feature distributions. For higher-capacity algorithms such as Random Forest, SVM, and XGBoost, multimodal feature sets—particularly Eye_Phys and All—consistently achieved superior performance. These results underscore the strength of complex models in leveraging complementary input sources and effectively capturing nuanced relationships across modalities.

Overall, the analysis confirms that feature diversity significantly enhances model accuracy and robustness, particularly when visual and physiological metrics are included. However, the findings also underscore that increasing feature complexity does not necessarily lead to proportionally higher performance, indicating the importance of strategic feature selection over sheer quantity.

3.2.2. Differences in Model Performance Across Algorithms

Figure 5 highlights the influence of algorithm choice on classification performance. Naive Bayes and KNN consistently underperformed relative to the other models, suggesting that their simplifying assumptions and limited model complexity may hinder their ability to capture the intricate patterns present in mental workload signals. In contrast, Logistic Regression achieved performance levels comparable to more advanced models such as Random Forest, SVM, and XGBoost. This finding indicates that, in certain contexts, a well-specified linear baseline can offer competitive accuracy, particularly when supported by informative and well-selected features.

To further explore how algorithm choice interacts with different feature sets, a detailed analysis of model performance was conducted and is presented in Figure 6. Specifically, with respect to AUC, no significant differences were observed among the four algorithms—Logistic Regression, Random Forest, SVM, and XGBoost—for the Eye, Phys, Perf_Eye, Eye_Phys, and All feature sets. However, for the Perf (driving performance only) feature set, Logistic Regression performed significantly worse than Random Forest (p < 0.01) and SVM (p < 0.001). Interestingly, for the Eye (eye movement only) feature set, the pattern was reversed, with Logistic Regression significantly outperforming the other models. These findings underscore the nuanced interplay between feature selection and algorithm choice in mental workload modeling. They suggest that the effectiveness of a given algorithm is highly contingent upon the nature and informativeness of the input features, highlighting the importance of aligning model complexity with data characteristics.

With respect to accuracy (ACC), significant differences among algorithms were observed for all feature sets except the Phys (physiological metrics only) set. These differences were primarily associated with the performance of Logistic Regression and XGBoost. In the Eye, Perf_Eye, and All feature sets, XGBoost significantly outperformed the other algorithms, demonstrating its capacity to effectively leverage richer, multimodal data. Conversely, in the Perf and Eye_Phys feature sets, Logistic Regression showed the lowest performance among the four algorithms. These results further emphasize that algorithm effectiveness is not uniform across feature types; rather, it is shaped by the complexity and informativeness of the input data. Selecting an appropriate algorithm-feature combination is therefore critical for achieving optimal model performance in mental workload classification.

In summary, feature sets limited to either driving performance alone or physiological signals alone consistently produced the lowest accuracy and AUC across all algorithms. This reinforces the earlier conclusion that relying on a single data modality is generally insufficient for effective mental workload classification. Furthermore, as illustrated in Figure 6, differences in model performance across feature sets were more pronounced than those across algorithms. These findings suggest that the composition and quality of input features have a greater impact on classification performance than the specific choice of machine learning algorithm—an observation that aligns with prior research [14].

3.3. Model Explanation

Following the comparative analysis of six classification algorithms across seven feature sets, four algorithms—Logistic Regression, Random Forest, SVM, and XGBoost—were selected for final model development. This selection was based on their consistently strong performance across multiple feature configurations. In contrast, Naive Bayes and KNN demonstrated significantly lower accuracy and AUC values, particularly when applied to visual or behavioral data, and were therefore excluded from further analysis.

To identify the most effective feature combinations, a Sequential Backward Selection (SBS) strategy was applied to the feature sets used by the four retained algorithms. SBS iteratively removes the least informative feature at each step based on validation performance, with the goal of optimizing model generalization while reducing complexity. This process resulted in algorithm-specific optimal feature subsets, each comprising different combinations of the original 12 features.

After systematically comparing various algorithms and feature combinations, the SVM was selected as the final model for classifying driver mental workload. This decision was based on a comprehensive evaluation of model accuracy, generalization capability, and training efficiency. As shown in Table 3, although advanced models such as Random Forest and XGBoost achieved competitive results, the SVM model attained the highest overall accuracy (ACC = 0.878) and a strong AUC (0.95) on the test set when trained on a combined feature set that included driving performance, eye movement, and physiological signals. Moreover, the SVM model demonstrated an effective balance between predictive performance and computational cost, making it a promising candidate for real-time workload monitoring applications. The final selected features included: mean lateral position, mean speed, mean accelerator position, blink rate, standard deviation of horizontal gaze position, heart rate, and heart rate growth rate.

Despite its strong predictive performance, the SVM algorithm is often regarded as a “black-box” model due to its limited transparency in explaining how input features influence classification outcomes. To address this limitation and enhance model interpretability, SHAP were employed to quantify the marginal contribution of each feature to the model’s predictions. SHAP offers a unified, game-theoretic framework for interpreting machine learning models by calculating Shapley values, which represent the average marginal contribution of a feature across all possible subsets of input features. This allows for a comprehensive understanding of both global and instance-level feature importance. Given that the SVM is not a tree-based model, the KernelSHAP method was adopted in this study. While more computationally intensive than TreeSHAP, KernelSHAP is model-agnostic and thus well-suited for explaining non-tree-based classifiers such as SVM.

Figure 7 presents the SHAP summary plot for the selected SVM model, illustrating the contribution of each input feature to the classification of driver mental workload. Features are ranked vertically in descending order of overall importance, with the most influential variables positioned at the top. Each point on the plot represents a single observation, and its position along the x-axis reflects the corresponding SHAP value—that is, the degree to which the feature influenced the prediction for that instance. The color gradient represents the actual feature value: red indicates high values, blue indicates low values, and purple denotes values near the mean. This visualization provides insight into both the direction and magnitude of each feature’s impact on the model’s output.

As shown in the SHAP summary plot, blink rate emerged as the most influential feature in classifying driver mental workload, followed by heart rate, standard deviation of horizontal gaze position, and mean speed. Blink rate exhibited a strong positive relationship with workload, with higher values consistently associated with elevated mental demand. Heart rate also displayed a clear positive SHAP value distribution for higher measurements, reaffirming that physiological arousal is a critical indicator of cognitive load. In contrast, gaze variability demonstrated a negative relationship with workload: lower standard deviation in horizontal gaze was linked to higher workload, suggesting that reduced visual scanning is indicative of cognitive strain. Mean speed showed a mild positive association with workload, particularly in instances where elevated speed co-occurred with increased heart rate, highlighting the potential interaction between behavioral and physiological indicators.

Other features—such as heart rate growth rate, mean lateral position, and mean accelerator position—were also retained in the final model, but exhibited relatively lower standalone SHAP values. This suggests that while their individual contributions to the model’s predictions may be limited, they could play a complementary role through interactions with more dominant features. To investigate this possibility, a SHAP interaction analysis was conducted to examine how features jointly influenced mental workload classification.

Figure 8 presents SHAP dependence plots for three primary features—blink rate, heart rate, and horizontal gaze variability—illustrating their interaction effects with secondary features. In each subplot, the x-axis represents the value of the primary feature, the y-axis indicates the corresponding SHAP value (i.e., its contribution to the model output), and the color gradient encodes the value of the interacting secondary feature. These plots provide insight into non-linear dependencies and synergistic effects between features, further enhancing understanding of the model’s decision-making process.

The interaction between blink rate and heart rate growth rate revealed that the influence of blink frequency on mental workload was amplified when the driver’s heart rate was also increasing. Specifically, the model predicted higher workload levels when both blink rate and heart rate acceleration were elevated, suggesting that the concurrent presence of visual fatigue and physiological arousal intensifies the perceived cognitive demand.

A similar interaction was observed between heart rate and mean speed. In this case, increases in speed contributed to higher workload predictions only when the driver’s heart rate exceeded approximately 80 beats per minute. This indicates that elevated speed alone does not necessarily imply increased workload unless it co-occurs with signs of physiological stress, underscoring the model’s sensitivity to contextual physiological cues.

Another notable interaction involved horizontal gaze variability and mean accelerator position. When gaze variability was low—indicating a narrow or concentrated field of visual attention—increased accelerator input was more likely to result in higher predicted workload levels. In contrast, when gaze was more dispersed, the effect of accelerator input on workload predictions was diminished. This suggests that a combination of intense visual focus and assertive pedal use may signal a cognitively demanding driving state, potentially linked to heightened task engagement or situational stress.

4. Model Application at Tunnel Entrance

Building upon the previously developed SVM-based mental workload model and its SHAP-based interpretation, this section explores the model’s practical applicability by predicting driver workload at tunnel entrances under varying roadway and lighting conditions. A case study was conducted using driving data collected from tunnel entrance segments, where participants encountered different combinations of longitudinal slopes and ambient light levels.

A total of 324 valid driving samples were obtained from 27 participants, each of whom completed 12 distinct driving scenarios. Each sample was classified using the finalized workload model. Table 4 summarizes the distribution of elevated and normal workload classifications across the different combinations of slope and lighting conditions.

In total, the model classified 214 out of 324 samples as elevated workload and 110 as normal workload, yielding a high-to-low ratio of nearly 2:1. This result indicates that tunnel entrances represent a driving environment where elevated cognitive demand is frequently experienced. Figure 9 illustrates the distribution of elevated workload proportions across varying longitudinal slopes and lighting conditions.

A clear trend emerged with respect to slope: steeper gradients were associated with a higher proportion of elevated workload classifications. For example, under the 4.0% slope condition, approximately 93% of samples were classified as elevated workload, compared to just 33% under the 2.5% slope condition. This pattern supports the assertion that increased longitudinal gradients place greater demands on drivers, thereby elevating mental workload [39].

In contrast, the effect of lighting conditions appeared less pronounced. Although minor variations in workload proportions were observed across different lighting environments (bright, reddish, and dark), the overall proportion of elevated workload samples remained relatively stable within each slope group. This suggests that longitudinal slope plays a more dominant role in influencing driver workload at tunnel entrances compared to changes in illumination.

Notably, even under bright lighting conditions—where the luminance contrast between the interior and exterior of the tunnel was minimal (85.62 vs. 83.71 cd/m²)—a substantial proportion of drivers (23.8%) were still classified as experiencing elevated workload. This finding indicates that the structural and spatial features of tunnel entrances, beyond luminance differences alone, may be significant contributors to cognitive strain during tunnel entry [23].

5. Conclusions

This study proposed and evaluated a machine learning framework for classifying drivers’ mental workload using multimodal data, including driving performance metrics, eye movement features, and physiological signals. Through a controlled driving simulator experiment, seven feature combinations and six classification algorithms were systematically compared. The results indicated that feature selection had a greater impact on model performance than algorithm choice. Notably, feature sets incorporating eye movement metrics consistently outperformed those relying solely on driving behavior or physiological measures, emphasizing the importance of visual attention in assessing cognitive demand.

While advanced algorithms such as SVM and XGBoost demonstrated strong classification performance, simpler models like Logistic Regression also achieved competitive results when supported by sufficient feature diversity. This finding suggests that including relevant and diverse features can reduce reliance on complex algorithms, offering a practical advantage for real-world workload monitoring systems.

To further refine model performance, a sequential backward selection approach was applied to identify optimal feature subsets for the four top-performing algorithms. The SVM model was ultimately selected as the final model due to its superior balance of accuracy, generalization capability, and computational efficiency. To enhance model interpretability, SHAP values were used to decompose the model’s predictions. The SHAP analysis revealed that blink rate and heart rate were the most influential predictors of mental workload, while interaction effects—such as those between heart rate and speed, or gaze variability and accelerator input—offered deeper insights into how physiological and behavioral indicators jointly influence workload estimation.

To evaluate the model’s real-world applicability, a case study was conducted using driving data from tunnel entrance segments with varying longitudinal slopes and lighting conditions. The model classified the majority of samples as elevated workload, particularly under steeper slope conditions, thereby validating its sensitivity to environmental complexity. These results demonstrate the model’s potential for integration into intelligent driving assistance systems and roadway infrastructure design. Moreover, the findings suggest that road geometry, rather than lighting conditions alone, plays a dominant role in shaping drivers’ cognitive states during complex roadway transitions.

Several limitations should be acknowledged. First, the sample size was relatively modest and predominantly male, which may limit the generalizability of the findings. Moreover, all data were collected in a simulated driving environment. While the simulator provided the advantage of controlling road conditions, ensuring participant safety, and allowing repeated exposure, its ecological validity is inevitably limited and may not fully capture the variability of real-world driving. To address this limitation, future work will focus on validating the proposed framework in naturalistic or on-road studies with larger and more diverse participant groups, thereby enhancing both the robustness and the practical applicability of the results. Second, the experimental vehicle was manually operated, limiting the generalizability of findings to partially or fully automated driving contexts. Future research should aim to validate the model in naturalistic settings, test cross-driver generalization, and extend the framework to higher levels of vehicle automation. Additionally, while eye-tracking and physiological sensors offer valuable insights, their large-scale deployment may be limited by cost and intrusiveness. Exploring less intrusive modalities (such as facial expressions or head pose) could both enhance accuracy and improve feasibility for real-world applications. Another limitation relates to the validation strategy. In this study, repeated 10-fold cross-validation was employed to obtain stable estimates with a limited dataset, which is also a common practice in driver workload studies. However, this approach may still allow partial subject-specific information to appear in both training and testing sets. A leave-one-subject-out (LOSO) strategy would provide a stricter evaluation of generalizability, and future work will consider this approach.

In summary, this study presents a robust, interpretable, and generalizable framework for driver mental workload classification. By integrating multimodal signals with explainable machine learning techniques, it contributes both methodological and practical value to the development of cognitively aware driving systems.

Author Contributions

Conceptualization, X.S., X.M., F.C. and X.P.; methodology, X.S. and X.M.; software, X.S.; validation, X.S. and X.M.; formal analysis, X.S.; investigation, X.S.; resources, F.C. and X.P.; data curation, X.S.; writing—original draft preparation, X.S.; writing—review and editing, X.M.; visualization, X.S.; supervision, F.C. and X.P.; project administration, X.S.; funding acquisition, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 51978522 and No. 51808402).

Institutional Review Board Statement

Institutional Review Board Statement approval is not required due to this research does not involve clinical diagnosis or treatment, drug trials, medical device trials, biological sample collection, invasive procedures, physical intervention, or other high-risk research activities.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

de Waard, D. The Measurement of Drivers’ Mental Workload; Groningen University: Groningen, The Netherlands, 1996. [Google Scholar]
Bitkina, O.V.; Park, J.; Kim, H.K. The Ability of Eye-Tracking Metrics to Classify and Predict the Perceived Driving Workload. Int. J. Ind. Ergon. 2021, 86, 103193. [Google Scholar] [CrossRef]
Jeong, H.; Liu, Y. Effects of Non-Driving-Related-Task Modality and Road Geometry on Eye Movements, Lane-Keeping Performance, and Workload While Driving. Transp. Res. Part F Traffic Psychol. Behav. 2019, 60, 157–171. [Google Scholar] [CrossRef]
Brookhuis, K.A.; de Waard, D. Monitoring Drivers’ Mental Workload in Driving Simulators Using Physiological Measures. Accid. Anal. Prev. 2010, 42, 898–903. [Google Scholar] [CrossRef] [PubMed]
Teh, E.; Jamson, S.; Carsten, O.; Jamson, H. Temporal Fluctuations in Driving Demand: The Effect of Traffic Complexity on Subjective Measures of Workload and Driving Performance. Transp. Res. Part F Traffic Psychol. Behav. 2014, 22, 207–217. [Google Scholar] [CrossRef]
Jeon, M.; Walker, B.N.; Yim, J.B. Effects of Specific Emotions on Subjective Judgment, Driving Performance, and Perceived Workload. Transp. Res. Part F Traffic Psychol. Behav. 2014, 24, 197–209. [Google Scholar] [CrossRef]
Young, M.S.; Brookhuis, K.A.; Wickens, C.D.; Hancock, P.A. State of Science: Mental Workload in Ergonomics. Ergonomics 2015, 58, 1–17. [Google Scholar] [CrossRef]
Wen, H.; Sze, N.N.; Zeng, Q.; Hu, S. Effect of Music Listening on Physiological Condition, Mental Workload, and Driving Performance with Consideration of Driver Temperament. Int. J. Environ. Res. Public Health 2019, 16, 2766. [Google Scholar] [CrossRef]
Öztürk, İ.; Merat, N.; Rowe, R.; Fotios, S. The Effect of Cognitive Load on Detection-Response Task (DRT) Performance During Day- and Night-Time Driving: A Driving Simulator Study with Young and Older Drivers. Transp. Res. Part F Traffic Psychol. Behav. 2023, 97, 155–169. [Google Scholar] [CrossRef]
Freitas, A.; Almeida, R.; Gonçalves, H.; Conceição, G.; Freitas, A. Monitoring Fatigue and Drowsiness in Motor Vehicle Occupants Using Electrocardiogram and Heart Rate: A Systematic Review. Transp. Res. Part F Traffic Psychol. Behav. 2024, 103, 586–607. [Google Scholar] [CrossRef]
Kim, H.; Hwang, Y.; Yoon, D.; Choi, W.; Park, C.H. Driver Workload Characteristics Analysis Using EEG Data from an Urban Road. IEEE Trans. Intell. Transp. Syst. 2014, 15, 1844–1849. [Google Scholar] [CrossRef]
Marquart, G.; Cabrall, C.; de Winter, J. Review of Eye-Related Measures of Drivers’ Mental Workload. Procedia Manuf. 2015, 3, 2854–2861. [Google Scholar] [CrossRef]
Tamantini, C.; Cristofanelli, M.L.; Fracasso, F.; Umbrico, A.; Cortellessa, G.; Orlandini, A.; Cordella, F. Physiological Sensor Technologies in Workload Estimation: A Review. IEEE Sens. J. 2025. [Google Scholar] [CrossRef]
Solovey, E.T.; Zec, M.; Perez, E.A.G.; Reimer, B.; Mehler, B. Classifying Driver Workload Using Physiological and Driving Performance Data: Two Field Studies. In Proceedings of the Conference on Human Factors in Computing Systems-Proceedings, Toronto, ON, Canada, 26 April–1 May 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 4057–4066. [Google Scholar]
Tran, C.; Yan, S.; Habiyaremye, J.L.; Wei, Y. Predicting Driver’s Work Performance in Driving Simulator Based on Physiological Indices. In Proceedings of the Intelligent Human Computer Interaction: 9th International Conference, Evry, France, 11–13 December 2017. [Google Scholar]
Tjolleng, A.; Jung, K.; Hong, W.; Lee, W.; Lee, B.; You, H.; Son, J.; Park, S. Classification of a Driver’s Cognitive Workload Levels Using Artificial Neural Network on ECG Signals. Appl. Ergon. 2017, 59, 326–332. [Google Scholar] [CrossRef]
Abd Rahman, N.I.; Md Dawal, S.Z.; Yusoff, N. Driving Mental Workload and Performance of Ageing Drivers. Transp. Res. Part F Traffic Psychol. Behav. 2020, 69, 265–285. [Google Scholar] [CrossRef]
Meteier, Q.; Capallera, M.; Ruffieux, S.; Angelini, L.; Abou Khaled, O.; Mugellini, E.; Widmer, M.; Sonderegger, A. Classification of Drivers’ Workload Using Physiological Signals in Conditional Automation. Front. Psychol. 2021, 12, 596038. [Google Scholar] [CrossRef]
He, D.; Wang, Z.; Khalil, E.B.; Donmez, B.; Qiao, G.; Kumar, S. Classification of Driver Cognitive Load: Exploring the Benefits of Fusing Eye-Tracking and Physiological Measures. Transp. Res. Rec. 2022, 2676, 670–681. [Google Scholar] [CrossRef]
Wei, W.; Fu, X.; Zhong, S.; Ge, H. Driver’s Mental Workload Classification Using Physiological, Traffic Flow and Environmental Factors. Transp. Res. Part F Traffic Psychol. Behav. 2023, 94, 151–169. [Google Scholar] [CrossRef]
Huang, J.; Peng, Y.; Hu, L. A Multilayer Stacking Method Base on RFE-SHAP Feature Selection Strategy for Recognition of Driver’s Mental Load and Emotional State. Expert Syst. Appl. 2024, 238, 121729. [Google Scholar] [CrossRef]
Ma, J.; Wu, Y.; Rong, J.; Zhao, X. A Systematic Review on the Influence Factors, Measurement, and Effect of Driver Workload. Accid. Anal. Prev. 2023, 192, 107289. [Google Scholar] [CrossRef] [PubMed]
Shao, X.; Chen, F.; Ma, X.; Pan, X. The Impact of Lighting and Longitudinal Slope on Driver Behaviour in Underwater Tunnels: A Simulator Study. Tunn. Undergr. Space Technol. 2022, 122, 104367. [Google Scholar] [CrossRef]
Mehler, B.; Reimer, B.; Dusek, J.A. MIT AgeLab Delayed Digit Recall Task (n-Back); Massachusetts Institute of Technology: Cambridge, MA, USA, 2011. [Google Scholar]
Kumagai, T.; Akamatsu, M. Prediction of Human Driving Behavior Using Dynamic Bayesian Networks. IEICE Trans. Inf. 2006, E89-D, 857–860. [Google Scholar] [CrossRef]
Yang, L.; Ma, R.; Zhang, H.M.; Guan, W.; Jiang, S. Driving Behavior Recognition Using EEG Data from a Simulated Car-Following Experiment. Accid. Anal. Prev. 2018, 116, 30–40. [Google Scholar] [CrossRef]
Xie, J.; Zhu, M. Maneuver-Based Driving Behavior Classification Based on Random Forest. IEEE Sensors Lett. 2019, 3, 1–4. [Google Scholar] [CrossRef]
Wang, W.; Xi, J.; Chong, A.; Li, L. Driving Style Classification Using a Semisupervised Support Vector Machine. IEEE Trans. Hum.-Mach. Syst. 2017, 47, 650–660. [Google Scholar] [CrossRef]
Shi, X.; Wong, Y.D.; Li, M.Z.F.; Palanisamy, C.; Chai, C. A Feature Learning Approach Based on XGBoost for Driving Assessment and Risk Prediction. Accid. Anal. Prev. 2019, 129, 170–179. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
Lundberg, S.M.; Allen, P.G.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems. arXiv 2017, arXiv:1705.07874. [Google Scholar]
Engström, J.; Johansson, E.; Östlund, J. Effects of Visual and Cognitive Load in Real and Simulated Motorway Driving. Transp. Res. Part F Traffic Psychol. Behav. 2005, 8, 97–120. [Google Scholar] [CrossRef]
Fuller, R. Towards a General Theory of Driver Behaviour. Accid. Anal. Prev. 2005, 37, 461–472. [Google Scholar] [CrossRef]
Recarte, M.Á.; Pérez, E.; Conchillo, Á.; Nunes, L.M.; Perez, M.A.; Conchillo, E.; Nunes, A. Mental Workload and Visual Impairment: Differences Between Pupil, Blink, and Subjective Rating. Span. J. Psychol. Copyr. 2008, 11, 374–385. [Google Scholar] [CrossRef]
Wang, Y.; Reimer, B.; Dobres, J.; Mehler, B. The Sensitivity of Different Methodologies for Characterizing Drivers’ Gaze Concentration Under Increased Cognitive Demand. Transp. Res. Part F Traffic Psychol. Behav. 2014, 26, 227–237. [Google Scholar] [CrossRef]
Meteier, Q.; De Salis, E.; Capallera, M.; Widmer, M.; Angelini, L.; Abou Khaled, O.; Sonderegger, A.; Mugellini, E. Relevant Physiological Indicators for Assessing Workload in Conditionally Automated Driving, Through Three-Class Classification and Regression. Front. Comput. Sci. 2022, 3, 775282. [Google Scholar] [CrossRef]
Cardone, D.; Perpetuini, D.; Filippini, C.; Mancini, L.; Nocco, S.; Tritto, M.; Rinella, S.; Giacobbe, A.; Fallica, G.; Ricci, F.; et al. Classification of Drivers’ Mental Workload Levels: Comparison of Machine Learning Methods Based on ECG and Infrared Thermal Signals. Sensors 2022, 22, 7300. [Google Scholar] [CrossRef]
Islam, M.R.; Barua, S.; Ahmed, M.U.; Begum, S.; Aricò, P.; Borghini, G.; Flumeri, G.D. A Novel Mutual Information Based Feature Set for Drivers’ Mental Workload Evaluation Using Machine Learning. Brain Sci. 2020, 10, 551. [Google Scholar] [CrossRef]
Feng, Z.; Yang, M.; Du, Y.; Xu, J.; Huang, C.; Jiang, X. Effects of the Spatial Structure Conditions of Urban Underpass Tunnels’ Longitudinal Section on Drivers’ Physiological and Behavioral Comfort. Int. J. Environ. Res. Public Health 2021, 18, 10992. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Driving simulator (left), eye tracking glasses with calibration board (top right), and physiology measuring device with sensors (bottom right).

Figure 2. Experiment design of road sections for each drive.

Figure 3. Model performance across feature sets (significant difference observed across all pairs except for the pairs noted by NS).

Figure 4. Model performance across algorithms for feature sets including eye movement metrics (significant difference observed across all pairs except for the pairs noted by NS).

Figure 5. Model performance across algorithms (significant difference observed across all pairs except for the pairs noted by NS).

Figure 6. Model performance across feature sets for different algorithms. Note: *, **, ***, **** indicate the variable was statistically significant at 0.05, 0.01, 0.001 and 0.0001 level, respectively.

Figure 7. Features that have the greatest impact on the driver’s mental workload.

Figure 8. SHAP interaction effects for key feature pairs, including blink rate and heart rate growth rate, heart rate and mean speed, horizontal gaze variability and average position of gas pedal.

Figure 9. Heatmap of elevated mental workload proportion across slopes and lighting conditions.

Table 1. Machine learning based driving mental workload studies.

Authors	Features			Algorithms	Classification Labels	Classes	Best Performance (Accuracy)
Authors	Subjective	Task Performance	Physiological	Algorithms	Classification Labels	Classes	Best Performance (Accuracy)
Solovey et al., 2014 [14]	NA	Vehicle velocity, steering wheel reversals	Heart rate, skin conductance level	Decision Tree; Logistic Regression; Multilayer Perceptron; Naïve Bayes; Nearest Neighbors.	N-back	2	75.7%
Tran et al., 2017 [15]	NASA-TLX	Number of errors	Heart rate, heart rate variability, blink rate, pupil dilation, blink duration, fixation duration.	Group method of data handling	Situation complexity	3	0.781 (R2)
Tjolleng et al., 2017 [16]	NA	NA	Heart rate variability	Artificial neural network	N-back	3	82%
Abd Rahman et al., 2020 [17]	NASA-TLX	Number of traffic violations, speed variability, reaction time	EEG	Multiple linear regression	Situation complexity	3	NA
Meteier et al., 2021 [18]	NA	NA	Heart rate variability, electrodermal activity, respiration	Random Forest; C-support Vector; Multi-Layer Perceptron	Verbal cognitive task	2	95%
He et al., 2022 [19]	NA	NA	Heart rate and heart rate variability, eye-tracking features, and galvanic skin response	Artificial neural network, K-Nearest Neighbors, Support Vector Machine, Feedforward Neural Network, Recurrent Neural Network, and Random Forest	N-back	3	97.8%
Wei et al., 2023 [20]	NA	Traffic volume, space headway	Heart rate growth rate, heart rate variability, electrodermal activity	Neural networks, support vector machines, and random forest	NASA-TLX scores	3	97.8%
Huang et al., 2024 [21]	NA	Steering wheel turning angle, steering wheel speed, following time distance, lateral position	EEG, electrodermal activity	XGBoost, LightGBM, CatBoost, K-Nearest Neighbors, multilayer stacking ensemble learning	NASA-TLX scores	3	97.48%

Note: NA refers to not applicable.

Table 2. Model performance of driver’s mental workload classification across feature sets and algorithms.

Metrics	Algorithms	Feature Sets
Metrics	Algorithms	Perf	Eye	Phys	Perf_Eye	Perf_Phys	Eye_Phys	All
ACC (%)	Logistic Regression	57.94	76.82	72.39	79.85	74.39	82.82	85.67
	Naïve Bayes	54.67	76.34	71.43	75.24	68.36	82.50	81.20
	KNN	54.20	75.66	61.29	79.72	57.96	81.08	80.98
	Random Forest	60.47	75.84	71.69	80.62	74.33	87.18	86.71
	SVM	62.39	77.81	72.71	79.55	72.09	87.48	86.22
	XGBoost	62.60	78.53	71.18	81.94	73.63	87.52	87.79
AUC	Logistic Regression	62.27	86.57	80.89	89.19	82.19	91.88	93.92
	Naïve Bayes	0.63	0.86	0.81	0.84	0.78	0.91	0.89
	KNN	0.54	0.86	0.66	0.87	0.61	0.89	0.91
	Random Forest	0.67	0.85	0.81	0.89	0.84	0.93	0.94
	SVM	0.68	0.83	0.81	0.88	0.81	0.94	0.94
	XGBoost	0.64	0.86	0.80	0.89	0.84	0.93	0.95

Table 3. Model performance of driver’s mental workload classification.

Algorithms	Optimal Feature Groups	Training Set		Testing Set		Training Time (s)
Algorithms	Optimal Feature Groups	ACC	AUC	ACC	AUC	Training Time (s)
Logistic Regression	Mean and std. of lateral position, blink rate, and heart rate growth rate.	86.5%	0.936	82.9%	0.93	1.596
Random Forest	Mean lateral position, mean speed, blink rate, heart rate, and heart rate growth rate.	87.6%	0.950	85.4%	0.95	12.080
SVM	Mean lateral position, mean speed, mean accelerator position, blink rate, std. of horizontal gaze position, heart rate, and heart rate growth rate.	88.2%	0.953	87.8%	0.95	2.177
XGBoost	Mean lateral position, mean speed, blink rate, heart rate, and heart rate growth rate.	87.5%	0.958	85.4%	0.96	4.732

Table 4. Number of elevated (outside the bracket) and normal (inside the bracket) mental workload samples at tunnel entrances under different slopes and lighting conditions.

Longitudinal Slopes	Light Conditions
Longitudinal Slopes	Bright	Reddish	Dark	Total
2.5%	9 (18)	12 (15)	8 (19)	29 (52)
3.0%	18 (9)	18 (9)	11 (16)	47 (34)
3.5%	25 (2)	21 (6)	20 (7)	66 (15)
4.0%	25 (2)	26 (1)	21 (6)	72 (9)
Total	77 (31)	77 (31)	60 (48)	214 (110)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shao, X.; Ma, X.; Chen, F.; Pan, X. Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach. Appl. Sci. 2026, 16, 3581. https://doi.org/10.3390/app16073581

AMA Style

Shao X, Ma X, Chen F, Pan X. Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach. Applied Sciences. 2026; 16(7):3581. https://doi.org/10.3390/app16073581

Chicago/Turabian Style

Shao, Xiaojun, Xiaoxiang Ma, Feng Chen, and Xiaodong Pan. 2026. "Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach" Applied Sciences 16, no. 7: 3581. https://doi.org/10.3390/app16073581

APA Style

Shao, X., Ma, X., Chen, F., & Pan, X. (2026). Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach. Applied Sciences, 16(7), 3581. https://doi.org/10.3390/app16073581

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Machine Learning Framework for Driver Mental Workload Classification: A Comparative and Interpretable Approach

Abstract

1. Introduction

2. Materials and Methods

2.1. Participants

2.2. Apparatus

2.3. Experiment Design

2.4. Procedure

2.5. Classification Models

2.5.1. Logistic Regression

2.5.2. Naive Bayes

2.5.3. K-Nearest Neighbors (KNN)

2.5.4. Random Forest

2.5.5. Support Vector Machine (SVM)

2.5.6. XGBoost

2.6. SHAP Explanation Model

3. Model Development and Results

3.1. Data Pre-Processing and Feature Generation

3.2. Model Development and Evaluation

3.2.1. Differences in Model Performance Across Feature Sets

3.2.2. Differences in Model Performance Across Algorithms

3.3. Model Explanation

4. Model Application at Tunnel Entrance

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI