CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features

Xie, Yicheng; Guo, Bin

doi:10.3390/s25185620

Open AccessArticle

CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features

by

Yicheng Xie

and

Bin Guo

^*

School of Electrical Engineering, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(18), 5620; https://doi.org/10.3390/s25185620

Submission received: 15 July 2025 / Revised: 14 August 2025 / Accepted: 2 September 2025 / Published: 9 September 2025

(This article belongs to the Section Physical Sensors)

Download

Browse Figures

Versions Notes

Abstract

The cognitive load of drivers directly affects the safety and practicality of advanced driving assistant systems, especially in autonomous driving scenarios where drivers need to quickly take control of the vehicle after performing non-driving-related tasks (NDRTs). However, existing driver cognitive load detection methods have shortcomings such as the inability to deploy invasive detection equipment inside vehicles and limitations to eye movement detection, which restrict their practical application. To achieve more efficient and practical cognitive load detection, this study proposes a multi-task non-contact cognitive load and physiological state estimation model based on RGB video, named CogMamba. The model utilizes multimodal features extracted from facial video and introduces the Mamba architecture to efficiently capture local and global temporal dependencies, thereby further jointly estimating cognitive load, heart rate (HR), and respiratory rate (RR). Experimental results demonstrate that CogMamba exhibits superior performance on two public datasets and shows excellent robustness under the cross-dataset generalization test. This study provides insights for non-contact driver state monitoring in real-world driving scenarios.

Keywords:

cognitive load detection; non-contact monitoring; multi-task learning; Mamba; physiological measurement

1. Introduction

According to the report of WHO, road traffic accidents account for 3% to 5% of GDP in most countries [1]. While it is clear that human factors account for a substantial portion of these accidents, the gradual advancement of intelligent driving systems and related assistive technologies holds promise for future improvement [2]. As autonomous driving technology developed, it was classified into five levels by the Society of Automotive Engineers (SAE) [3], but this has not led to a significant decrease in the rate of traffic accidents compared to manual driving by humans [4]. In Level 2 (L2) automation, vehicles can perform basic driving tasks, but continuous driver supervision remains necessary, and the driver must be ready to take control at any moment. In contrast, Level 3 (L3) automation allows the vehicle to handle most driving tasks, although the driver must still intervene in response to take-over requests (TORs). At the same time, since human vehicle operation will gradually decrease, the chances of the driver participating in NDRTs increase. There is a marked difference in cognitive demand when a driver is engaged in an NDRT compared to when they are focused solely on driving [5]. This disparity significantly affects the driver’s ability to resume control of the vehicle swiftly and effectively in unexpected situations [6,7]. Numerous accidents have occurred due to drivers not responding adequately to such transitions [8,9]. Therefore, establishing the Driver Monitoring System (DMS) to monitor driver cognitive load is vital for the continued development of autonomous driving technologies and to improve road traffic safety. At the same time, accurately detecting human cognitive load can effectively help drivers belonging to special groups, such as the hearing-impaired, to better complete autonomous driving tasks [10].

Driver cognitive load detection has attracted growing attention in both academic research and industry applications [11]. Traditional methods typically rely on physiological sensors to monitor the state of the subject [12,13]. Though physiological sensors are effective in monitoring the driver’s state, they also present several limitations. Invasive sensors such as electroencephalography (EEG) [14] are not suitable for vehicle deployment, and non-invasive sensors often require drivers to wear expensive and complex equipment [15], making them impractical for real-world driving scenarios [16]. As a result, contactless methods for the detection of cognitive load have been explored. Some studies have employed non-contact devices (e.g., cameras) to estimate cognitive load [17,18,19]. However, most of them solely focus on eye movement, or utilize non-end-to-end machine learning (ML), which overlooks other physiological responses and the consequent lower generalization.

In fact, HR and RR are critical physiological indicators of cognitive load [20], as fluctuations in cognitive demand often manifest as measurable changes in these signals [21]. To achieve a more comprehensive detection of cognitive load, it is necessary to model both HR and RR in addition to cognitive load [21]. However, training independent models for each task would significantly increase deployment costs and reduce iterative efficiency [22]. Given the intrinsic correlations among cognitive load, HR, and RR, a multi-task model presents a compelling solution [23]. Nevertheless, constructing a vision-based multi-task model for cognitive load detection introduces several challenges. First, integrating multiple tasks with distinct data distributions and learning objectives into a single model increases the complexity of training and optimization compared to single-task models [24]. Second, the relationships among cognitive load, HR, and RR are not consistently stable, making it difficult to infer one metric directly from another through simple joint modeling [25]. Third, vision-based systems often struggle to capture temporal dependencies effectively [26], which also impairs the training efficiency and performance of multi-task models.

For the above reasons, this paper proposes an RGB multi-task video-based driver cognitive load and physiological estimation model (i.e., CogMamba). Building on previous research [27,28,29], the feature extraction module is designed to avoid the computational cost and excessive redundant information by focusing on key facial features of the driver. Moreover, given the importance of physiological indicators such as HR and RR in evaluating driver states, we incorporate remote photoplethysmography (rPPG) to extract blood volume pulse signals from critical facial regions [30,31,32]. We organize these optical signals to construct the multimodal information input, i.e., a spatio-temporal map (STMap) [33,34]. For the feature interaction module, we introduce the Mamba architecture [35]. Aligned feature vectors are passed through stacked Mamba blocks, where the state transition matrix enables continuous temporal updates and efficient dynamic feature enhancement [36]. Leveraging the capabilities of state space models (SSMs), our approach captures long-range dependencies among features, enhancing the model’s ability to infer driver states [35]. Notably, compared to conventional deep learning models like Transformers; with computational complexity

O (N^{2})

, the Mamba structure maintains a linear complexity of

O (N)

, making it more suitable for deployment [37]. This is beneficial for building lightweight models. Following the Mamba module, a lightweight two-layer multilayer perceptron (MLP) is employed for downstream tasks, avoiding more complex structures such as residual convolutions or multi-head attention mechanisms.

In summary, the main contributions of this work are as follows:

As far as we know, this work is the first end-to-end multi-task non-contact driver cognitive load and physiological estimation model with multimodal facial features from a camera.
The proposed CogMamba utilizes STMap and key facial features—including landmarks, eye regions, and mouth area—instead of full-frame video input, thereby significantly reducing model parameters and computational load.
We incorporate the Mamba architecture to enhance the extraction of both local and global temporal features, achieving higher efficiency and lower resource consumption compared to traditional attention mechanisms. This efficiency gain stems from the recursive nature of SSMs, which eliminates the need for attention matrix computation. Additionally, lightweight MLPs are employed to further simplify the model architecture and reduce overall complexity.
The proposed system demonstrates strong performance in assessing driver cognitive load. Furthermore, the experimental results show that the system performs robustly under varying lighting conditions and across different skin tones.

2. Related Works

2.1. Contact-Based Cognitive Load Detection

The cognitive load of drivers plays a critical role in their ability to perform driving tasks within autonomous driving environments, underscoring the need for accurate and efficient detection methods [38]. According to the existing literature [39], contact-based cognitive load detection has been widely explored. These traditional approaches typically rely on physiological sensors that must maintain direct contact with the driver’s body to capture signals such as EEG, electrocardiography (ECG), and electrodermal activity (EDA) [40].

For instance, Gerjets et al. developed a model based on EEG signals to evaluate cognitive load [41]. However, due to the inherent challenges in acquiring EEG data, alternative methods using ECG and EDA signals have also been investigated [40]. Despite their effectiveness, these contact-based approaches rely on invasive or semi-invasive sensor configurations, requiring physical contact with the driver’s body [42]. This limitation significantly restricts their applicability in real-world driving scenarios [43].

2.2. Non-Contact-Based Cognitive Load Detection

Recent years have seen increasing interest in non-contact cognitive load detection, which eliminates the need for intrusive physiological sensors and is thus suitable for real-world applications such as driving, e-learning, and human–computer interaction. Traditional approaches mainly rely on hand-crafted features derived from visual (e.g., eye tracking, pupil dilation, facial expression) or remote physiological signals (e.g., HR variability from magnetic cardiography, non-contact EEG) [43]. Various machine learning models have been applied for cognitive load classification, including Support Vector Machines (SVMs), Random Forests (RFs), and gradient boosting [44]. For instance, Rahman et al. [17]. employed eye-tracking features and SVM to achieve high accuracy in binary cognitive load classification tasks. Similarly, HRV-based models using classical ML methods such as KNN or Decision Trees have reported high performance when classifying low and high cognitive load [45]. However, these methods often rely heavily on manually selected features and are highly sensitive to environmental noise and subject variability. Moreover, most traditional ML models lack generalization to unseen conditions or participants, which limits their applicability in dynamic and real-world settings.

Deep learning has recently demonstrated superior performance over traditional machine learning across a wide range of perception tasks. Models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Vision Transformer (VIT), and their variants (e.g., CNN-LSTM, CNN-Transformer, attention-based networks) [46] have been used to automatically extract and learn representations from raw sensor data, including EEG, eye movement, and video. Notably, some deep learning models have achieved accuracy comparable or superior to handcrafted-feature models. For instance, CNN-based architectures have been applied to gaze heatmaps and achieved high accuracy in classifying task-induced mental workload [17]. Meanwhile, facial expression recognition is of great significance for cognitive load detection. Currently, there are many CNN architectures conducting research in this area, such as AlexNet, VGG16, and ResNet50 [47,48]. Recent efforts also include multimodal deep learning combining video, audio, and physiological data, achieving robust cognitive load estimation through shared representation learning [49]. At the same time, some researchers have attempted to use reinforcement learning methods, such as Bayesian reinforcement learning, to conduct related research on cognitive load [50]. Nevertheless, most of these works treat cognitive load estimation as a single-task problem, often targeting only binary or multi-class classification of load levels. This fails to capture the multifaceted nature of human cognitive loads and limits model utility in complex scenarios where multiple related cognitive and behavioral outputs need to be estimated simultaneously [51].

To address these limitations, we propose a novel approach that leverages multi-task deep learning for non-contact cognitive load detection. Our method is built upon the recently proposed Mamba architecture—a sequence modeling framework that achieves state-of-the-art performance in various vision and language tasks [37]. Mamba introduces SSMs with linear recurrent mechanisms, offering a compelling trade-off between long-range dependency modeling and computational efficiency. We adapt Mamba for the multi-task setting by designing a unified model, and the multi-task formulation improves generalization. To the best of our knowledge, this is the first work to introduce Mamba in the context of cognitive load estimation and the first to formulate a multi-task deep learning solution for non-contact cognitive load assessment.

2.3. Mamba

Mamba was initially introduced for efficient long-sequence modeling in natural language processing [37]. With its linear recurrent architecture and selective state updates, Mamba quickly gained traction, leading to multiple variants across domains [52,53]. In vision, Bidirectional State Space Models (BSSMs) were incorporated to form Vision Mamba (Vim) [54], which processes image sequences with enhanced context awareness and positional encoding. Vim achieves faster inference and reduced memory usage compared to Transformers on high-resolution images, making it a strong candidate for visual multi-task learning tasks such as non-contact cognitive load estimation.

3. Methodology

3.1. Preliminaries

3.1.1. State Space Modeling and Discretization Principles

When dealing with multimodal sequence data, it is crucial to construct an effective model that can describe its dynamic evolution. SSM, a mathematical framework widely used in control systems, has also demonstrated powerful time-series modeling capabilities in deep learning. The basic idea is to use an internal state vector to characterize the dynamic process of input signals over time and generate output responses.

For typical linear time-invariant (LTI) systems in the continuous time domain, a state space model can be represented by the following differential equation:

h^{'} (t) = A h (t) + B x (t), y (t) = C h (t) + D x (t)

(1)

Among them, the memory that records historical inputs is the hidden state of the system

h (t) \in R^{N}

; the current input is

x (t) \in R

; and the corresponding output is

y (t) \in R

.

A \in R^{N \times N}

is the state transition matrix, which can describe the changing patterns of the state itself;

B, C

is the projection matrix of the input and state; and D is the residual connection term.

In order to adapt to discrete input sequences (such as image frames) in deep learning, the above system must be converted to a discrete-time form. A commonly used method is zero-order hold (ZOH), which assumes that the input remains constant within each sampling period [55]. The parameters of the continuous system can be mapped to the discrete system using matrix exponentials:

h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k}, y_{k} = C h_{k} + D x_{k}

(2)

Among them, the discretized system matrices are given by

\bar{A} = e^{A Δ t}, \bar{B} = (\int_{0}^{Δ t} e^{A τ} d τ) B

, where

Δ t

denotes the sampling time interval. This discretization strategy enables the state space model to be directly integrated into neural networks for processing discrete sequential inputs of arbitrary length, while still preserving the benefits of continuous-time dynamics. In the Mamba architecture, the state transition matrix A is not learned as a dense matrix. Instead, it is parameterized as a diagonal matrix with negative real entries via

A = - e^{A_{log}}

where

A_{log}

is a learnable low-dimensional parameter vector. This formulation guarantees the stability of the system by ensuring all eigenvalues of A are negative. To discretize A, Mamba applies an efficient element-wise exponential,

\bar{A} = e^{A Δ t}

, which is computationally inexpensive due to the diagonal form of A. The input-dependent nature of

Δ t

further empowers the model to adaptively emphasize or attenuate different parts of the input sequence, thereby enabling selective temporal modeling. This stable, efficient, and input-aware state space formulation is particularly well-suited for processing long-range dependencies in visual sequential data, such as facial dynamics in videos. Our symbols and descriptions are summarized in Table 1.

3.1.2. Advantages of the Convolution Equivalence Form and Mamba Model

Although the above discrete iterative form is very similar to traditional RNNs, further expansion reveals that it is mathematically equivalent to one-dimensional convolution operations. Specifically, if the state evolution of the previous L time steps is expanded, a convolution kernel can be constructed by weighting and summing the historical inputs:

K = [C B, C A B, C A^{2} B, \dots, C A^{L - 1} B]

(3)

Thus, the output sequence can be obtained by calculating the convolution with the input sequence:

y = x * K

(4)

The advantage of this convolutional form lies in its ability to utilize highly parallelized hardware (such as GPUs) to perform simultaneous computations across the entire sequence, thereby significantly enhancing the efficiency of modeling long sequences.

The Mamba model is developed based on this convolutional representation. It combines a structured state space parameter design with efficient numerical discretization methods, enabling the SSM technique—originally designed for low-speed feedback control—to be extended for application in large-scale visual and signal processing tasks. Compared to attention-based models like Transformers, Mamba does not rely on global attention mechanisms with quadratic complexity. Instead, it maintains good scalability while modeling long-range dependencies through state space recursion.

Additionally, Mamba’s core includes linear projections, gating mechanisms, and nonlinear transformations, enabling it to simulate the linear system behavior of traditional SSMs while also expressing complex nonlinear dynamic processes. This integrated design makes Mamba an ideal foundational architecture for processing multimodal continuous signals, providing a unified dynamic modeling framework for subsequent cross-modal alignment and fusion.

3.2. Overview

This study develops a non-contact cognitive load detection model for drivers based on Mamba (i.e., CogMamba). The overall method combines the long-time series modeling advantages of the structured state space model Mamba and constructs a multi-stage information processing flow based on task characteristics. As shown in the Figure 1, the overall framework includes the following main stages: feature extraction and alignment, bidirectional feature interaction, bidirectional feature fusion, and optimization target design.

First, to fully capture the dynamic information related to facial regions and physiological signals, the input data undergoes preprocessing and sliding window segmentation, followed by the extraction of local temporal features from multiple key regions. We designed customized feature extraction modules for five parts: left eye, right eye, mouth, facial landmarks, and STMap. We chose them as features for the following reasons. First, the eye and mouth regions are widely recognized as key facial areas that exhibit significant changes under varying cognitive load, such as alterations in blink rate, gaze stability, and mouth movements [56,57]. Second, facial landmarks provide a holistic representation of the entire facial structure and its dynamic states, enabling the capture of subtle changes in head pose, micro-expressions, and overall muscle tension that are also indicative of cognitive load [58,59]. Finally, STMaps are derived from remote photoplethysmography (rPPG) signals and provide temporal patterns of physiological responses, such as heart rate variations, which are closely associated with mental workload [60,61]. By combining these three modalities, our approach leverages both visual and physiological indicators to achieve a more comprehensive and robust estimation of the driver’s cognitive load. Features from different parts vary in expression, temporal characteristics, and spatial density, so they must be aligned using a unified strategy to standardize frame rates and scales after extraction.

After feature alignment, the outputs from the five submodules are serialized into a unified format as input for subsequent temporal modeling. To achieve efficient and structurally aware multi-source information fusion, we introduce the Mamba structure during the feature interaction stage. Specifically, the five serialized feature streams maintain their individual characteristics while undergoing bidirectional modeling through a shared state space dynamic matrix, capturing long-term dependencies and enabling modal-level collaborative dialogue.

Subsequently, the five interaction features processed by the Mamba encoder are further integrated in the bidirectional fusion module. The unified representation after fusion is fed into MLP prediction heads, which output three key physiological and psychological indicators: HR, RR, and cognitive load. Through an end-to-end training mechanism, the model can maintain high temporal resolution while jointly modeling multiple output targets, thereby improving the overall accuracy and robustness of the estimates.

3.3. Feature Extraction and Alignment

Given that the structure and consistency of the input have a decisive impact on the effectiveness of the final representation learning, feature extraction and alignment are essential. To fully explore the temporal correlation between local facial regions and physiological indicators, we designed a complete data processing workflow before formally entering the modeling stage, covering the entire chain of operations from raw input organization to high-dimensional feature representation, including data processing, feature extraction, and alignment.

Data Processing

Specifically, to ensure temporal consistency, all data segments are sampled using a sliding window method, with a fixed length of

L = 300

and a step size of

s = 30

for each sampling, ensuring the continuity and redundancy of local signals on the time axis. This strategy not only improves the stability of the model training process but also lays the foundation for the subsequent establishment of models for long-term dependencies.

After sampling, we cropped out local image sequences of the left eye, right eye, and mouth regions. At the same time, we extracted the corresponding facial landmarks’ coordinate information, and an STMap was generated based on rPPG technology for each frame [62,63]. The image sequences of the left eye, right eye, and mouth are unified into a four-dimensional tensor of shape

R^{L \times 3 \times H \times W}

, where

L = 300

denotes the sequence length of 300 frames per segment, and

C = 3

represents the number of RGB channels. The image sizes vary by region, being 25 × 25 for the eye area and 15 × 35 for the mouth area. In response to this, we designed three customized feature extraction modules (see Figure 2) for five types of input to generate aligned temporal feature vectors. Eye and mouth image sequences are encoded by structurally consistent convolutional networks

S u b r e g i o n E m b e d d i n g

. This module employs two layers of 2D convolution and pooling operations to extract spatial local features. Prior to convolution, a

W T S M

module is introduced to apply temporal perturbation to the input, thereby enhancing temporal robustness. After the encoding process, the extracted results are mapped to a fixed-length vector of dimension D via a fully connected layer. The output for each region is a sequence feature of shape

R^{L \times d}

.

The facial landmarks are embedded using the

L a n d m a r k E m b e d d i n g

network. This network first reshapes the landmark data into a four-dimensional tensor

R^{L \times 106 \times 2}

and extracts spatial structural features through a convolutional layer with normalization and activation functions. Here, 106 represents the 106 points that constitute the facial landmarks, and 2 represents the two-dimensional plane on which they are located. The extracted features are flattened and input into a fully connected layer, mapped to a d-dimensional vector, and then restored to a sequence form

R^{L \times d}

.

The encoding of STMap is performed by the

S T M a p E m b e d d i n g

module. This module receives a four-dimensional tensor

R^{3 \times H \times L}

as input, which is processed through two layers of convolution, batch normalization, and activation functions. Subsequently, adaptive average pooling is applied to compress the spatial dimension to a uniform width (300 time steps). Finally, the sequence features are permuted and reshaped into

R^{L \times d}

, where the feature dimension is determined by the number of output channels from the convolution layer.

The above five features are strictly aligned in the temporal dimension, as they are all extracted using the same index sliding window method, have consistent lengths, and require no additional interpolation. To achieve a unified representation of information, the five feature sequences are concatenated in the channel dimension to form the final feature sequence

F \in R^{L \times D}

, where

D = 5 * d

. The concatenated features retain the unique structural characteristics of each region while forming a high-dimensional representation with cross-regional semantic complementarity, laying the foundation for a unified representation for subsequent temporal modeling and modal interaction.

3.4. Bidirectional Feature Interaction

3.4.1. Feature Extraction and Alignment

In order to achieve effective coordination and semantic fusion between the temporal features of different facial regions, we introduced the structured state space model Mamba in the modeling stage after feature extraction to construct a bidirectional feature interaction mechanism (Mamba encoder in Figure 3). This module receives the joint feature sequence

F \in R^{L \times D}

output from the feature extraction and alignment stages, where L is the length of the time dimension and D is the total feature dimension after concatenation. Considering that the temporal dynamics of different facial regions exhibit bidirectional dependencies—i.e., the physiological state at the current time is influenced by both preceding and subsequent frames—it is necessary to construct a bidirectional sequence modeling structure. We process the input features

F

through two symmetric Mamba modules for forward and backward modeling, respectively, to obtain dynamic representations in both directions.

Assuming that

F_{forward}

and

F_{backward}

represent the outputs of the forward and backward encoders, respectively, the Mamba module models them as state space sequence recursions:

H_{forward} = {Mamba}_{forward} (F), H_{backward} = {Mamba}_{backward} (F)

(5)

In the specific implementation, the two Mamba modules share the same structure but not the same parameters and take the original sequence and its time-reversed form as inputs, respectively. In each direction, state space modeling essentially performs a series of linear state updates with a recursive structure, the core mechanism of which can be expressed in the following form:

h_{t} = A h_{t - 1} + B x_{t}, y_{t} = C h_{t} + D x_{t}

(6)

To obtain a unified output dimension, the outputs in both directions are kept consistent in the time dimension, with the same shape as

R^{L \times D^{'}}

. Next,

H_{forward}

and

H_{backward}

are separately input into the linear mapping layer and converted into a unified-dimension embedding representation:

Z_{forward} = Linear (H_{forward}), Z_{backward} = Linear (H_{backward})

(7)

The purpose of performing linear mapping is to compress and regularize the temporal dynamic features after interaction, enabling them to retain sufficient information while achieving stronger generalization capabilities. Finally, the outputs from both directions are concatenated to form the fused interaction feature sequence:

H_{bi} = [Z_{forward} ‖ Z_{backward}] \in R^{L \times 2 D^{″}}

(8)

The bidirectional feature interaction module we have created not only enhances the model’s ability to perceive long-term dependent information across different regions of the face but also improves the consistency of cross-modal temporal dynamics through a shared modeling mechanism. Its output serves as input for the subsequent feature fusion stage, continuing the representation construction process in multi-task prediction.

3.4.2. Bidirectional Feature Fusion

Following the bidirectional feature interaction process, the model obtains the feature sequence

H_{bi} \in R^{L \times 2 D^{″}}

output by the Mamba encoder. This sequence simultaneously encodes information from both forward and backward temporal modeling and includes dynamic dependencies among the five categories of facial region features. We have constructed an integrated feature fusion module that combines prediction and fusion to further integrate these bidirectional temporal features and perform multi-task prediction.

The objective of the fusion stage is to compress long-term temporal information and extract high-semantic global representations. Therefore, we first perform pooling operations on

H_{bi}

along the temporal dimension. Let the pooling operation be denoted by the function

ϕ (\cdot)

. Typically, average pooling or weighted attention pooling can be selected, but in this study, we adopt average pooling along the temporal dimension to compress the sequence into a fixed-length representation:

h_{fused} = ϕ (H_{bi}) = \frac{1}{L} \sum_{t = 1}^{L} H_{bi} (t) \in R^{2 D^{″}}

(9)

The global vector

h_{fused}

here encapsulates the dynamic information from the entire time period and is transmitted to the task prediction head in a unified feature space.

Upon entering the prediction phase, we constructed a multi-branch MLP module with a shared input, corresponding to the outputs of the three tasks: HR, RR, and cognitive load. Among these, HR and RR are regression tasks, with outputs being scalar real numbers, while cognitive load is a binary classification task, with outputs being probabilities between 0 and 1. Specifically, let the final fused representation be

h_{fused} \in R^{2 D^{″}}

. The prediction functions for the three tasks can be formally expressed as

{\hat{y}}_{hr} = f_{hr} (h_{fused}), {\hat{y}}_{hr} = f_{hr} (h_{fused}), {\hat{y}}_{cog} = σ (f_{cog} (h_{fused}))

(10)

Here,

f_{\cdot} (\cdot)

represents three independent MLP mappings, and

σ (\cdot)

is the sigmoid activation function, used to normalize the output to classification probabilities. The entire prediction structure we constructed adopts an end-to-end training approach, with the three tasks sharing the underlying encoder and bidirectional fusion mechanism, branching only at the prediction layer. This structure significantly enhances parameter sharing and information complementarity in multi-task modeling, effectively suppressing task interference while maintaining prediction accuracy. At this point, the model has completed the entire process from raw video segments to multi-task predictions.

3.5. Optimization Goal

We have developed differentiated optimization strategies based on the nature of each task’s labels and prediction forms. Within this multi-task learning framework, tasks share a common underlying feature representation structure, while independent loss function branches are set at the top level, enabling joint training through a unified optimization function.

In the cognitive load estimation task, the model must perform a binary classification of the driver’s current load state. Since these labels are primarily derived from subjective questionnaire assessments, their judgment criteria are susceptible to individual subjective differences. Therefore, traditional cross-entropy loss exhibits poor robustness when faced with label bias and inconsistency. To address this, we introduce truncated cross-entropy loss during training to suppress the dominant gradient effect of outlier samples far from the decision boundary during backpropagation, thereby enhancing the model’s adaptability to uncertain labels. Assuming the sample prediction probability is

p \in [0, 1]

and the label is

y \in {0, 1}

, the truncated loss form can be expressed as

L_{cog} = - log (max (p_{y}, ϵ))

(11)

where

p_{y}

is the predicted probability of the true label, and

ϵ

is the lower threshold to avoid numerical instability, typically set as a small constant.

For the regression task of HR and RR, due to the natural fluctuations in the numerical distribution of physiological indicators, the use of

L_{1}

loss is sensitive to outliers, which affects model stability. To enhance robustness to marginal samples, we use smooth

L_{1}

loss as the optimization criterion, which approaches the loss toward linear punishment when the error is large when the error is small, effectively balancing accuracy and stability. For any real-valued prediction

\hat{y}

and label y, the loss is defined as

L_{reg} = \{\begin{matrix} 0.5 {(\hat{y} - y)}^{2}, & if | \hat{y} - y | < 1 \\ | \hat{y} - y | - 0.5, & otherwise \end{matrix}

(12)

These are used for HR and RR, respectively, denoted as

L_{hr}

and

L_{rr}

.

During the early stages of joint training, specific tasks may introduce unnecessary interference during gradient propagation. Therefore, we introduce a dynamic weight adjustment mechanism to control the multi-task learning for overall optimization. Let the current training iteration be

I t e r_{c u r r e n t}

, the total iteration step be

I t e r_{t o t a l}

, and introduce the time factor t and the adaptation weight coefficient

λ

be

t = \frac{{Iter}_{current}}{{Iter}_{total}} \cdot 2

(13)

λ = \frac{2}{1 + exp (- 10 t)}

(14)

This design ensures that the model primarily focuses on optimizing the main loss function during the early iteration stages, while the influence of the regularization term gradually increases in later stages. The joint loss expression is as follows:

L_{total} = L_{cog} + λ L_{hr} + λ L_{rr}

(15)

Our algorithm 1 is as follows:

Algorithm 1: Multimodal facial sequence modeling for HR, RR and cognitive load estimation

Input: Video samples with corresponding labels:

X^{leye}, X^{reye}, X^{mouth}, X^{facial}, X^{stmap}, y_{hr}, y_{rr}, y_{cog}

Output: Predicted HR

{\hat{y}}_{hr}

, RR

{\hat{y}}_{rr}

, and Cognitive Load

{\hat{y}}_{cog}

// Step 1: Feature Extraction

Extract region-specific features:

F_{l e y e} \leftarrow SubregionEmbedding (X^{leye})

;

F_{r e y e} \leftarrow SubregionEmbedding (X^{reye})

;

F_{m o u t h} \leftarrow SubregionEmbedding (X^{mouth})

;

F_{f a c i a l} \leftarrow LandmarkEmbedding (X^{facial})

;

F_{s t m a p} \leftarrow STMapEmbedding (X^{stmap})

;

// Step 2: Feature Alignment and Concatenation

Align and concatenate all extracted features along the channel dimension:

F \leftarrow Concat (F_{l e y e}, F_{r e y e}, F_{m o u t h}, F_{f a c i a l}, F_{s t m a p})

;

// Step 3: Bidirectional Feature Interaction via Mamba

H_{forward} \leftarrow Mamba forward (F)

;

H_{backward} \leftarrow Mamba backward (F)

;

Z_{forward} \leftarrow Linear (H_{forward})

;

Z_{backward} \leftarrow Linear (H_{backward})

;

H_{b i} \leftarrow Concat (Z_{forward}, Z_{backward})

;

// Step 4: Temporal Aggregation

h_{f u s e d} \leftarrow MeanPool (H_{b i})

;

// Step 5: Multi-task Prediction

{\hat{y}}_{hr} \leftarrow f_{h r} (h_{f u s e d})

;

{\hat{y}}_{rr} \leftarrow f_{r r} (h_{f u s e d})

;

{\hat{y}}_{cog} \leftarrow σ (f_{c o g} (h_{f u s e d}))

;

// Step 6: Compute Loss and Backpropagate

L_{hr} \leftarrow SmoothL 1 ({\hat{y}}_{hr}, y_{hr})

;

L_{rr} \leftarrow SmoothL 1 ({\hat{y}}_{rr}, y_{rr})

;

L_{cog} \leftarrow TruncatedCE ({\hat{y}}_{cog}, y_{cog})

; Compute

λ

;

L_{total} \leftarrow L_{hr} + L_{rr} + L_{cog}

; Backpropagate and update parameters.

4. Experiment

4.1. Datasets and Evaluation Metrics

This study selected two multimodal driving datasets, eDream [64] and MCDD [29], to evaluate the proposed method’s multi-task prediction capabilities for physiological and cognitive loads in natural driving scenarios. The eDream [64] dataset was collected in Canada using the NADS miniSim fixed-base driving simulator and covers 36 participants under the age of 35 (gender-balanced and ethnically balanced). Video was captured using multi-angle cameras (GoPro cameras at 29.97 Hz, Logitech cameras at 30 Hz, and an eye tracker at 60 Hz); physiological signals (ECG and RESP) were sampled at 240 Hz using Becker Meditec sensors, and cognitive load was assessed using the NASA-TLX questionnaire [65]. Scores below 30 were considered low load, and scores above 60 were considered high load.

The MCDD [29] dataset was collected in China (with 42 participants and an average age of 35.28 years (range 23–53 years), primarily of East Asian ethnicity) using the Silab 7.1 system to simulate a driving environment. Video capture was performed using the Orbbec Gemini Pro (640 × 480 resolution, 30 Hz frame rate), and physiological signals were uniformly collected at 100 Hz using Ergoneers devices. Cognitive load was assessed using the NASA-TLX questionnaire, with normalized scores > 10 accompanied by non-driving tasks classified as high load. The dataset included various task settings, with each participant completing 21 driving trials. All video frames were uniformly interpolated to 30 Hz, and physiological signals were synchronously resampled. It is worth noting that both of the datasets we used are publicly available datasets. They were collected after ethical review and approval by the corresponding research teams, and we also signed data sharing agreements with them separately.

Following previous research methods [23,66], we used accuracy, F1 score, sensitivity, and specificity to evaluate the performance of cognitive load estimation. We employed mean absolute error (MAE), root mean square error (RMSE), and Pearson correlation coefficient (P) to assess the model’s performance in predicting HR and RR. Moreover, for each dataset, all subjects were randomly divided into training, validation, and test sets in a 6:2:2 ratio. Model training was conducted on the training set, hyperparameters were tuned using the validation set, and final performance was evaluated exclusively on the test set.

4.2. Baselines

To comprehensively evaluate the performance of the proposed CogMamba model in multi-task estimation of cognitive load and physiological parameters, we compared it with several mainstream benchmark methods, including traditional methods, single-task deep learning methods, and multi-task learning models. We reproduced these models based on the source papers of them and tested them using our two datasets to obtain the results.

For cognitive load estimation, we selected several common architectures as single-task models, including CNN [17] based on eye region inputs, LSTM [17], CNN + SVM [17], and AE + SVM [17], representing typical approaches from feature extraction to sequence modeling. Additionally, CLERA [51], a structure specifically designed for eye feature modeling, was also included in the comparison. To analyze the impact of raw video modeling capabilities, we also introduced ResNet3D [67], ViViT [68], and VideoMamba [69] models that take full-face videos as input [33]. VDMoE [29] was the only multi-task model participating in the cognitive load estimation task [23].

In terms of physiological parameter (HR and RR) estimation, we first considered three classic traditional methods: CHROM [70], POS [71], and ARM-RR [72]. These methods are all based on face videos and employ explicit color space transformations or signal processing strategies. Second, four representative single-task HR estimation methods based on deep learning were included: Dual-GAN [73], ConDiff-rPPG [30], HSRD [33], and DG-rPPG [34]. All of these methods use STMap as input and demonstrate varying degrees of performance improvement in HR monitoring. For multi-task estimation of HR and RR, four currently representative deep learning models were selected: MTTS-CAN [74], BigSmall [75], MultiPhys [76], and PhysMLE [22]. The first two use raw video as input, while the latter two adopt the STMap structure, covering various modeling paradigms. VDMoE [29], as a current representative of multi-task fusion structures, provides important references for the comparative analysis of CogMamba through its structural design and loss branch partitioning. Relevant comparison results are detailed in Table 2, Table 3 and Table 4.

4.3. Implementation Details

This research model is implemented using the PyTorch (version 2.2.0) framework, and all experiments were conducted on a server equipped with an NVIDIA RTX A6000 graphics card. To extract facial temporal features, facial landmarks are first obtained from the input video and subregion images, and the STMap is constructed based on this to enhance the expression of spatio-temporal information. During the training phase, the Adam optimizer is used, with an initial learning rate of 0.00001, a batch size of 250, and a total number of iterations set to 20,000 to ensure that the model converges adequately under multi-task objectives.

4.4. Results of the Comparison Experiment

First, in terms of cognitive load estimation, CogMamba achieved an accuracy rate of 83.56% on the eDream [64] dataset, representing improvements of 28.10% and 25.01% compared to traditional single-task machine learning models such as CNN [17] (65.23%) and LSTM [17] (66.84%), respectively. Compared to the deep learning baseline CLERA [51] (75.82%), there was a 10.21% improvement, and compared to the current representative multi-task model VDMoE [29] (79.89%), there was nearly a 4.6% improvement. Similarly, on the MCDD [29] dataset, CogMamba achieved an accuracy of 81.22%, with superior performance in F1 score (71.62%) and sensitivity (79.33%). These results demonstrate that CogMamba not only possesses stronger feature representation capabilities than traditional machine learning models but also outperforms other deep structures due to the effective modeling of temporal dependencies through the introduction of the Mamba module; Additionally, the multi-task parallel optimization strategy better captures the synergistic associations between cognitive and physiological tasks compared to traditional single-task training.

Furthermore, in terms of physiological signal estimation, CogMamba achieved an HR prediction error (MAE) of 8.97 on the eDream [64] dataset, outperforming all baseline models, with a 24.69% reduction compared to the best multi-task baseline PhysMLE [71] (11.91). In RR prediction, CogMamba’s MAE is 2.20, outperforming PhysMLE [71] (3.22) by approximately 31.68%, while its Pearson correlation coefficient is also the highest currently reported (0.39). On the MCDD [29] dataset, CogMamba continues to lead, with an HR prediction MAE of 10.04, which is 18.36% lower than PhysMLE [71] (12.03). In RR prediction, its MAE is 4.18, which is also 18.7% lower than PhysMLE [71] (5.12).

From a methodological classification perspective, traditional methods (such as CHROM [70] and POS [71]) perform significantly worse in both tasks, with HR prediction errors generally above 17. HR prediction errors for deep learning methods, especially those using STMap input (such as ConDiff-rPPG and HSRD), are reduced to approximately 13–15. However, these methods are designed for a single physiological parameter and lack multi-task generalization capabilities. In contrast, multi-task methods have the ability to learn multiple objectives simultaneously. Among them, CogMamba effectively models long-term temporal dependencies using the Mamba structure, achieving simultaneous improvements in accuracy and efficiency under limited resource constraints.

4.5. Results of Ablation Study

We designed multiple ablation experiments to further validate the effectiveness of each submodule in the CogMamba model and compared the results with the complete model, as shown in Table 5 and Table 6. Specifically, we constructed five model variants by removing STMap, facial landmarks, the left eye, the right eye, and the mouth region, respectively. We also compared and analyzed ViViT [68] and ResNet3D [67] as typical video input multi-task model baselines. The experimental results from the eDream [64] and MCDD [29] datasets show that the complete CogMamba achieves the best performance in all three tasks: cognitive load estimation, HR, and RR. In contrast, the removal of each module has varying degrees of impact on performance.

First, when STMap is removed, the model experiences a significant performance decline in physiological signal estimation. On the eDream [64] dataset, the MAE for HR increases from 8.97 to 15.12 (a 68.56% increase), and the MAE for RR increases from 2.20 to 6.41 (a 191.36% increase); on the MCDD [29] dataset, the MAE for HR increases to 16.22, the MAE for RR increases to 8.31, and the p-value also decreases significantly. This indicates that the periodic color changes in the facial region captured by STMap are crucial for rPPG modeling, as they explicitly reflect the temporal patterns of heart rate and respiration-related signals [77]. Therefore, relying solely on subregions and facial landmarks is insufficient to recover the fine-grained rhythmic features in physiological signals. Additionally, when the facial landmarks module is removed, the model’s performance degrades across all three tasks, particularly in cognitive load estimation accuracy, which decreases from 83.56% to 81.23%, while the MAE for HR and RR estimation also slightly increases. This change indicates that facial landmarks provide macro-level facial movements (such as frowning or opening the mouth) and expression information, which are highly correlated with cognitive load during driving [58], while also aiding in aligning the structural alignment of the STMap region and enhancing representational consistency.

Furthermore, after removing the subregion inputs, the performance of cognitive load estimation significantly decreases, with accuracy rates on the eDream [64] dataset dropping to 70.54% and 75.54%, respectively, and F1 scores decreasing by over 4%. This indicates that the eye and mouth regions are critical for cognitive load recognition, particularly the eye region, which is closely related to cognitive load [78]. Additionally, the loss of these regions has a relatively minor impact on HR and RR estimation, suggesting their role is more oriented toward cognitive modeling. At the same time, comparisons of p-values across multiple variants show that the complete model maintains the optimal Pearson correlation coefficients for HR and RR (both 0.66 and 0.39), further validating the complementary roles of each submodule in temporal consistency modeling. Thus, each submodule in CogMamba plays a distinct and irreplaceable role. STMap serves as the foundation for physiological modeling; facial landmarks aid in structural alignment and emotional expression; and the eye and mouth subregions enhance the modeling of cognitive cues. We further investigated the performance of the model when only one feature was retained and visualized it as shown in Figure 4 and Figure 5. The integration of these designs enables CogMamba to balance accuracy, robustness, and collaboration across multiple tasks.

4.6. Impact of the Number of Mamba Layers

To assess the specific impact of the number of Mamba layers on the performance of the CogMamba model, we systematically tested the performance of Mamba structures with one to six layers in multi-task estimation on the eDream [64] and MCDD [29] datasets. The tasks included cognitive load classification, HR, and RR regression. The experimental results are shown in Figure 6, illustrating the trends in three types of metrics across different numbers of layers.

In terms of cognitive load accuracy, the curves for both datasets show consistent trends: as the number of Mamba layers increases from 1 to 3, accuracy significantly improves, with eDream [64] rising from 78.0% to 83.5% and MCDD [29] from 75.0% to 81.2%, representing increases of 5.5% and 6.2%, respectively. Beyond three layers, accuracy slowly decreases, with the performance at six layers being slightly equivalent to that at two layers. This indicates that stacking an appropriate number of layers can effectively enhance the model’s ability to capture temporal features of driving states, while too many layers may introduce overfitting risks, leading to a decline in generalization performance. For the estimation of HR and RR, both datasets achieved optimal results with a three-layer Mamba structure. In eDream [64], the MAE for HR was 8.97 bpm and the MAE for RR was 2.20 rpm; in MCDD [29], the MAE for HR was 10.04 bpm, and the MAE for RR was 4.18 rpm. The overall trend is similar to that of cognitive load: MAE decreases rapidly between one and three layers, but increases slowly beyond three layers. This phenomenon indicates that a multi-layer Mamba structure helps the model extract more complex temporal–spatial periodic signal features, but excessive stacking may introduce redundant modeling or even noise fitting issues.

This phenomenon can be explained by changes in the learning ability of the sequence modeling mechanism. As the number of layers increases, the Mamba module gains the ability to extract multi-level representations from the original signal, thereby constructing more complex nonlinear function mapping relationships, which facilitates multimodal fusion [79]. However, when stacked too deeply, the model may lose its generalization ability. In addition, too many layers can cause training instability, such as gradient vanishing or explosion [80]. In summary, the three-layer Mamba structure achieves a good balance between accuracy and stability, making it the optimal configuration selected in this study.

4.7. Cross-Dataset Estimation

We conducted cross-dataset experiments on the eDream [64] and MCDD [29] datasets to evaluate the generalization ability of the CogMamba model. By training the model on one dataset and testing it on another, we verified the model’s performance under unconstrained conditions in unknown scenarios.

As shown in Table 7, when CogMamba is trained on the eDream [64] dataset and tested on the MCDD [29] dataset, the model maintains good performance across different populations and recording conditions. CogMamba achieves 68.49% accuracy in the cognitive load classification task, outperforming VDMoE [29] (65.97%), ViViT [68] (58.92%), and ResNet3D [67] (57.03%). For HR and RR regression estimates, CogMamba’s HR MAE is 12.66 bpm, slightly higher than VDMoE [29] but significantly better than baselines such as ViViT [68] (16.26 bpm), ResNet3D [67] (17.07 bpm), and MultiPhys [76] (16.26 bpm). The RR MAE is 3.88 rpm, which is not the best but still belongs to the top tier of performance. Similarly, as shown in Table 8, when CogMamba is trained on the MCDD [29] dataset and tested on the eDream [64] dataset, the model maintains its high-quality performance. In the cognitive load classification task, the accuracy reached 66.85%, higher than VDMoE [29] (65.97%), ViViT [68] (56.23%), and ResNet3D [67] (55.43%). The HR MAE was 16.86 bpm, which is at an intermediate level. The RR MAE was 5.72 rpm, slightly higher than VDMoE [29] (5.20 rpm) and DG-rPPG [34] (5.43 rpm), but significantly better than ResNet3D [67] (7.32 rpm) and ViViT [68] (6.93 rpm). Compared to single-task deep models like HSRD [33] and DG-rPPG [34], or multi-task models like PhysMLE [22] and MultiPhys [76] that process STMap inputs, CogMamba generally maintains the same or better performance, demonstrating its strong generalization capabilities.

4.8. Results of the Computational Cost Study

Table 9 presents the computational cost comparison in terms of the number of parameters, floating-point operations (FLOPs), and inference time for different models. Among all compared methods, CogMamba achieves the lowest computational overhead, with only 5.95 million parameters, 3.17 GFLOPs, and an inference time of 1.82 ms. This is substantially lower than heavy architectures such as ViViT [68] (87.35 M parameters, 283.06 GFLOPs, 29.81 ms) and ResNet3D [67] (33.37 M parameters, 40.70 GFLOPs, 5.57 ms). Even compared with other lightweight approaches such as VDMoE [29] (7.34 M parameters, 4.05 GFLOPs, 1.91 ms) and HSRD [33] (12.16 M parameters, 4.94 GFLOPs, 1.99 ms), CogMamba still exhibits clear advantages. The combination of a small parameter count, low FLOPs, and short inference time demonstrates the model’s suitability for real-time, resource-constrained deployment scenarios, such as in-vehicle driver monitoring systems, where both efficiency and responsiveness are critical.

4.9. Case Study

To further investigate the impact of facial movements on the model’s prediction performance, a representative sample from the dataset was selected for testing. As shown in Figure 7, the HR and RR predictions in the gray area exhibit significant deviations when the face is moving and not directly facing the camera, indicating that the model’s ability to extract physiological features decreases during facial movements. This finding highlights a significant challenge in remote cognitive load monitoring—facial actions can impair the accuracy of physiological signal estimation, thus affecting the overall reliability of the system.

5. Conclusions

In this study, we propose a multi-task non-contact cognitive load and physiological state estimation model based on RGB video, named CogMamba. Our model combines the advantages of the Mamba architecture while incorporating multimodal facial features through the heterogeneous embedding network. Specifically, compared to traditional deep learning architectures like CNNs and Transformers, our model achieves the ability to capture global context and long-range dependencies while reducing computational complexity, enabling more efficient extraction of shared features. In addition, we extract key regions and facial landmarks from the driver’s face in the video and obtain the driver’s STMap using rPPG technology. By using these three elements as multimodal inputs, the model can focus on truly useful core facial features, thereby reducing the model’s dependence on local non-critical inputs and improving the model’s performance and robustness. To evaluate the feasibility of our method, we tested the model on two public datasets, and the results indicate that CogMamba outperforms other previous models. Therefore, our model has the potential to be deployed in an in-vehicle environment. Meanwhile, since we use multimodal inputs rather than directly inputting video frames into the model, we can effectively protect user privacy. Nevertheless, our model still has room for improvement, particularly in the generalization in cross-dataset testing. In the future, we should train the model on larger datasets and conduct larger-scale generalization tests. Additionally, considering the cost of training the model [32], we need to take into account real-world practical needs and develop more efficient and advanced models.

Author Contributions

Methodology, Y.X.; Resources, B.G.; Software, Y.X.; Supervision, B.G.; Visualization, Y.X.; Writing—original draft, Y.X.; Writing—review and editing, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The eDream dataset used in this study can be accessed at https://www.dsp.utoronto.ca/projects/eDREAM/ (accessed on 12 March 2025). For the MCDD dataset, please contact the author of its source paper at jwanggo@connect.ust.hk.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NDRT	Non-driving-related Task
HR	Heart Rate
RR	Respiratory Rate
SAE	The Society of Automotive Engineers
TORs	Take-over Requests
EEG	Electroencephalography
ML	Machine Learning
rPPG	Remote Photoplethysmography
STMap	Spatial-temporal Map
SSMs	State Space Models
MLP	Multilayer Perceptron
ECG	Electrocardiography
EDA	Electrodermal Activity
SVM	Support Vector Machines
RF	Random Forests
CNN	Convolutional Neural Networks
RNN	Recurrent Neural Networks
BSSM	Bidirectional State Space Models
LTI	Linear Time-invariant
ZOH	Zero-order Hold
MAE	Mean Absolute Error
RMSE	Root Mean Square Error

References

Tandrayen-Ragoobur, V. The economic burden of road traffic accidents and injuries: A small island perspective. Int. J. Transp. Sci. Technol. 2025, 17, 109–119. [Google Scholar] [CrossRef]
Cruz, O.G.D.; Padilla, J.A.; Victoria, A.N. Managing road traffic accidents: A review on its contributing factors. IOP Conf. Ser. Earth Environ. Sci. 2021, 822, 012015. [Google Scholar] [CrossRef]
Lee, J.D. Perspectives on automotive automation and autonomy. J. Cogn. Eng. Decis. Mak. 2018, 12, 53–57. [Google Scholar] [CrossRef]
Gresset, C.; Morda, D. Assessing the Human Barriers and Impact of Autonomous Driving in Transportation Activities: A Multiple Case Study. 2021. Available online: https://www.diva-portal.org/smash/get/diva2:1560021/FULLTEXT01.pdf (accessed on 1 September 2025).
Häuslschmid, R.; Pfleging, B.; Butz, A. The influence of non-driving-related activities on the driver’s resources and performance. In Automotive User Interfaces: Creating Interactive Experiences in the Car; Springer International Publishing: Cham, Switzerland, 2017; pp. 215–247. [Google Scholar]
Wang, J.; Wang, A.; Yan, S.; He, D.; Wu, K. Revisiting Interactions of Multiple Driver States in Heterogenous Population and Cognitive Tasks. arXiv 2024, arXiv:2412.13574. [Google Scholar] [CrossRef]
Wang, A.; Wang, J.; Huang, C.; He, D.; Yang, H. Exploring how physio-psychological states affect drivers’ takeover performance in conditional automated vehicles. Accid. Anal. Prev. 2025, 216, 108022. [Google Scholar] [CrossRef]
Dixit, V.V.; Chand, S.; Nair, D.J. Autonomous vehicles: Disengagements, accidents and reaction times. PLoS ONE 2016, 11, e0168054. [Google Scholar] [CrossRef]
Huang, C.; Wang, J.; Wang, A.; Huang, Q.; He, D. The Effect of Advanced Driver Assistance Systems on Truck Drivers’ Defensive Driving Behaviors: Insights from a Preliminary On-Road Study. Int. J. Hum.-Comput. Interact. 2025, 1–15. [Google Scholar] [CrossRef]
Wei, D.; Zhang, C.; Fan, M.; Ge, S.; Mi, Z. Research on multimodal adaptive in-vehicle interface interaction design strategies for hearing-impaired drivers in fatigue driving scenarios. Sustainability 2024, 16, 10984. [Google Scholar] [CrossRef]
Wang, A.; Huang, C.; Wang, J.; He, D. The association between physiological and eye-tracking metrics and cognitive load in drivers: A meta-analysis. Transp. Res. Part F Traffic Psychol. Behav. 2024, 104, 474–487. [Google Scholar] [CrossRef]
Vanneste, P.; Raes, A.; Morton, J.; Bombeke, K.; Van Acker, B.B.; Larmuseau, C.; Depaepe, F.; Van den Noortgate, W. Towards measuring cognitive load through multimodal physiological data. Cogn. Technol. Work 2021, 23, 567–585. [Google Scholar] [CrossRef]
Wang, A.; Wang, J.; Shi, W.; He, D. Cognitive Workload Estimation in Conditionally Automated Vehicles Using Transformer Networks Based on Physiological Signals. Transp. Res. Rec. 2024, 2678, 1183–1196. [Google Scholar]
Yedukondalu, J.; Sunkara, K.; Radhika, V.; Kondaveeti, S.; Anumothu, M.; Murali Krishna, Y. Cognitive load detection through EEG lead wise feature optimization and ensemble classification. Sci. Rep. 2025, 15, 842. [Google Scholar] [CrossRef]
Anwar, U.; Arslan, T.; Hussain, A. Hearing Loss, Cognitive Load and Dementia: An Overview of Interrelation, Detection and Monitoring Challenges with Wearable Non-invasive Microwave Sensors. arXiv 2022, arXiv:2202.03973. [Google Scholar] [CrossRef]
Razak, S.F.A.; Yogarayan, S.; Aziz, A.A.; Abdullah, M.F.A.; Kamis, N.H. Physiological-based driver monitoring systems: A scoping review. Civ. Eng. J. 2022, 8, 3952–3967. [Google Scholar] [CrossRef]
Rahman, H.; Ahmed, M.U.; Barua, S.; Funk, P.; Begum, S. Vision-based driver’s cognitive load classification considering eye movement using machine learning and deep learning. Sensors 2021, 21, 8019. [Google Scholar] [CrossRef]
Misra, A.; Samuel, S.; Cao, S.; Shariatmadari, K. Detection of driver cognitive distraction using machine learning methods. IEEE Access 2023, 11, 18000–18012. [Google Scholar] [CrossRef]
Lu, H.; Niu, X.; Wang, J.; Wang, Y.; Hu, Q.; Tang, J.; Zhang, Y.; Yuan, K.; Huang, B.; Yu, Z.; et al. Gpt as psychologist? Preliminary evaluations for gpt-4v on visual affective computing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 322–331. [Google Scholar]
Alshanskaia, E.I.; Zhozhikashvili, N.A.; Polikanova, I.S.; Martynova, O.V. Heart rate response to cognitive load as a marker of depression and increased anxiety. Front. Psychiatry 2024, 15, 1355846. [Google Scholar] [CrossRef]
Ayres, P.; Lee, J.Y.; Paas, F.; Van Merrienboer, J.J. The validity of physiological measures to identify differences in intrinsic cognitive load. Front. Psychol. 2021, 12, 702538. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Lu, H.; Wang, A.; Yang, X.; Chen, Y.; He, D.; Wu, K. Physmle: Generalizable and priors-inclusive multi-task remote physiological measurement. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4908–4925. [Google Scholar]
Wang, J.; Wang, A.; Hu, H.; Wu, K.; He, D. Multi-Source Domain Generalization for ECG-Based Cognitive Load Estimation: Adversarial Invariant and Plausible Uncertainty Learning. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1631–1635. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
Larraga-García, B.; Bejerano, V.R.; Oregui, X.; Rubio-Bolívar, J.; Quintana-Díaz, M.; Gutiérrez, Á. Physiological and performance metrics during a cardiopulmonary real-time feedback simulation to estimate cognitive load. Displays 2024, 84, 102780. [Google Scholar] [CrossRef]
Zhou, W.; Zhao, L.; Zhang, R.; Cui, Y.; Huang, H.; Qie, K.; Wang, C. Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey. arXiv 2024, arXiv:2412.00348. [Google Scholar] [CrossRef]
Doudou, M.; Bouabdallah, A.; Berge-Cherfaoui, V. Driver drowsiness measurement technologies: Current research, market solutions, and challenges. Int. J. Intell. Transp. Syst. Res. 2020, 18, 297–319. [Google Scholar] [CrossRef]
Yang, H.; Wu, J.; Hu, Z.; Lv, C. Real-time driver cognitive workload recognition: Attention-enabled learning with multimodal information fusion. IEEE Trans. Ind. Electron. 2023, 71, 4999–5009. [Google Scholar] [CrossRef]
Wang, J.; Yang, X.; Wang, Z.; Wei, X.; Wang, A.; He, D.; Wu, K. Efficient mixture-of-expert for video-based driver state and physiological multi-task estimation in conditional autonomous driving. arXiv 2024, arXiv:2410.21086. [Google Scholar]
Wang, J.; Wei, X.; Lu, H.; Chen, Y.; He, D. Condiff-rppg: Robust remote physiological measurement to heterogeneous occlusions. IEEE J. Biomed. Health Inform. 2024, 28, 7090–7102. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Tang, J.; Wang, J.; Lu, Y.; Cao, X.; Hu, Q.; Wang, Y.; Zhang, Y.; Xie, T.; Zhang, Y.; et al. Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot. arXiv 2025, arXiv:2505.10257. [Google Scholar] [CrossRef]
Wang, J.; Yang, X.; Hu, Q.; Tang, J.; Liu, C.; He, D.; Wang, Y.; Chen, Y.; Wu, K. PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring. arXiv 2025, arXiv:2507.19172. [Google Scholar]
Wang, J.; Lu, H.; Wang, A.; Chen, Y.; He, D. Hierarchical style-aware domain generalization for remote physiological measurement. IEEE J. Biomed. Health Inform. 2023, 28, 1635–1643. [Google Scholar] [CrossRef]
Wang, J.; Lu, H.; Han, H.; Chen, Y.; He, D.; Wu, K. Generalizable Remote Physiological Measurement via Semantic-Sheltered Alignment and Plausible Style Randomization. IEEE Trans. Instrum. Meas. 2024, 74, 5003014. [Google Scholar] [CrossRef]
Wang, C.; Tsepa, O.; Ma, J.; Wang, B. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv 2024, arXiv:2402.00789. [Google Scholar]
Xie, X.; Cui, Y.; Tan, T.; Zheng, X.; Yu, Z. Fusionmamba: Dynamic feature enhancement for multimodal image fusion with mamba. Vis. Intell. 2024, 2, 37. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Stojmenova, K.; Duh, S.E.; Sodnik, J. A review on methods for assessing driver’s cognitive load. IPSI BGD Trans. Internet Res. 2018, 14, 1–8. [Google Scholar]
Reimer, B.; Mehler, B.; Coughlin, J.F.; Godfrey, K.M.; Tan, C. An on-road assessment of the impact of cognitive workload on physiological arousal in young adult drivers. In Proceedings of the 1st International Conference on Automotive User Interfaces and Interactive Vehicular Applications, Essen, Germany, 21–22 September 2009; pp. 115–118. [Google Scholar]
Zander, T.O.; Andreessen, L.M.; Berg, A.; Bleuel, M.; Pawlitzki, J.; Zawallich, L.; Krol, L.R.; Gramann, K. Evaluation of a dry EEG system for application of passive brain-computer interfaces in autonomous driving. Front. Hum. Neurosci. 2017, 11, 78. [Google Scholar] [CrossRef]
Gerjets, P.; Walter, C.; Rosenstiel, W.; Bogdan, M.; Zander, T.O. Cognitive state monitoring and the design of adaptive instruction in digital environments: Lessons learned from cognitive workload assessment using a passive brain-computer interface approach. Front. Neurosci. 2014, 8, 385. [Google Scholar] [CrossRef] [PubMed]
Kumar, N.; Kumar, J. Measurement of cognitive load in HCI systems using EEG power spectrum: An experimental study. Procedia Comput. Sci. 2016, 84, 70–78. [Google Scholar] [CrossRef]
Ahmed, M.U.; Begum, S.; Gestlöf, R.; Rahman, H.; Sörman, J. Machine Learning for Cognitive Load Classification—A Case Study on Contact-Free Approach. In Proceedings of the Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, 5–7 June 2020; Proceedings, Part I 16. Springer: Cham, Switzerland, 2020; pp. 31–42. [Google Scholar]
Ahmed, S.G.; Verbert, K.; Siedahmed, N.; Khalil, A.; AlJassmi, H.; Alnajjar, F. AI Innovations in rPPG Systems for Driver Monitoring: Comprehensive Systematic Review and Future Prospects. IEEE Access 2025, 13, 22893–22918. [Google Scholar] [CrossRef]
Nasri, M.; Kosa, M.; Chukoskie, L.; Moghaddam, M.; Harteveld, C. Exploring Eye Tracking to Detect Cognitive Load in Complex Virtual Reality Training. In Proceedings of the 2024 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), Bellevue, WA, USA, 21–25 October 2024; pp. 51–54. [Google Scholar]
Razzaq, K.; Shah, M. Machine learning and deep learning paradigms: From techniques to practical applications and research frontiers. Computers 2025, 14, 93. [Google Scholar] [CrossRef]
Plazibat, I.; Gašperov, L.; Petričević, D. Nudging Technique in Retail: Increasing Consumer Consumption. JCGIRM 2021, 2, 1–9. [Google Scholar] [CrossRef]
Paulchamy, B.; Yahya, A.; Chinnasamy, N.; Kasilingam, K. Facial expression recognition through transfer learning: Integration of VGG16, ResNet, and AlexNet with a multiclass classifier. Acadlore Trans. AI Mach. Learn. 2025, 4, 25–39. [Google Scholar] [CrossRef]
Khan, M.A.; Asadi, H.; Qazani, M.R.C.; Lim, C.P.; Nahavandi, S. Functional near-infrared spectroscopy (fNIRS) and Eye tracking for Cognitive Load classification in a Driving Simulator Using Deep Learning. arXiv 2024, arXiv:2408.06349. [Google Scholar]
Arumugam, D.; Ho, M.K.; Goodman, N.D.; Van Roy, B. Bayesian reinforcement learning with limited cognitive load. Open Mind 2024, 8, 395–438. [Google Scholar] [CrossRef]
Ding, L.; Terwilliger, J.; Parab, A.; Wang, M.; Fridman, L.; Mehler, B.; Reimer, B. CLERA: A unified model for joint cognitive load and eye region analysis in the wild. ACM Trans. Comput.-Hum. Interact. 2023, 30, 1–23. [Google Scholar] [CrossRef]
Yadav, S.; Tan, Z.H. Audio mamba: Selective state spaces for self-supervised audio representations. arXiv 2024, arXiv:2406.02178. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, Q.; Liu, H.; Xiao, T.; Qian, X.; Ahmed, B.; Ambikairajah, E.; Li, H.; Epps, J. Mamba in speech: Towards an alternative to self-attention. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 1933–1948. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Gholipour, Y.; Mirabdollahi Shams, E. Introduction New Combination of Zero-Order Hold and First-Order Hold. 2014. Available online: https://ssrn.com/abstract=5231692 (accessed on 30 July 2014).
Reed, P.; Steed, I. The effects of concurrent cognitive task load on recognising faces displaying emotion. Acta Psychol. 2019, 193, 153–159. [Google Scholar] [CrossRef] [PubMed]
Moon, J.; Ryu, J. The effects of social and cognitive cues on learning comprehension, eye-gaze pattern, and cognitive load in video instruction. J. Comput. High. Educ. 2021, 33, 39–63. [Google Scholar] [CrossRef]
Fridman, L.; Reimer, B.; Mehler, B.; Freeman, W.T. Cognitive load estimation in the wild. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–9. [Google Scholar]
Abilkassov, S.; Kairgaliyev, M.; Zhakanov, B.; Abibullaev, B. A System For Drivers’ Cognitive Load Estimation Based On Deep Convolutional Neural Networks and Facial Feature Analysis. In Proceedings of the 2021 22nd IEEE International Conference on Industrial Technology (ICIT), Virtual, 10–12 March 2021; Volume 1, pp. 994–1000. [Google Scholar]
Zhong, R.; Zhou, Y.; Gou, C. Est-tsanet: Video-based remote heart rate measurement using temporal shift attention network and estmap. IEEE Trans. Instrum. Meas. 2023, 73, 1–14. [Google Scholar] [CrossRef]
Kim, D.Y.; Cho, S.Y.; Lee, K.; Sohn, C.B. A study of projection-based attentive spatial–temporal map for remote photoplethysmography measurement. Bioengineering 2022, 9, 638. [Google Scholar] [CrossRef]
Wang, J.; Yang, X.; Lu, H.; He, D.; Wu, K. Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization. arXiv 2025, arXiv:2506.16160. [Google Scholar]
Yang, X.; Fan, Y.; Liu, C.; Su, H.; Guo, W.; Wang, J.; He, D. Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement. arXiv 2025, arXiv:2507.07908. [Google Scholar]
Liu, C.C. Data Collection Report. 2017. Available online: https://www.dsp.toronto.edu/projects/eDREAM/publications/edream/test.pdf (accessed on 30 July 2014).
Hart, S.G.; Staveland, L.E. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Adv. Psychol. 1988, 52, 139–183. [Google Scholar]
Peng, Y.; Deng, H.; Xiang, G.; Wu, X.; Yu, X.; Li, Y.; Yu, T. A multi-source fusion approach for driver fatigue detection using physiological signals and facial image. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16614–16624. [Google Scholar] [CrossRef]
Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Li, K.; Li, X.; Wang, Y.; He, Y.; Wang, Y.; Wang, L.; Qiao, Y. Videomamba: State space model for efficient video understanding. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 237–255. [Google Scholar]
De Haan, G.; Jeanne, V. Robust pulse rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 2013, 60, 2878–2886. [Google Scholar] [CrossRef]
Wang, W.; Den Brinker, A.C.; Stuijk, S.; De Haan, G. Algorithmic principles of remote PPG. IEEE Trans. Biomed. Eng. 2016, 64, 1479–1491. [Google Scholar] [CrossRef] [PubMed]
Tarassenko, L.; Villarroel, M.; Guazzi, A.; Jorge, J.; Clifton, D.; Pugh, C. Non-contact video-based vital sign monitoring using ambient light and auto-regressive models. Physiol. Meas. 2014, 35, 807. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Han, H.; Zhou, S.K. Dual-GAN: Joint BVP and Noise Modeling for Remote Physiological Measurement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12404–12413. [Google Scholar]
Liu, X.; Fromm, J.; Patel, S.; McDuff, D. Multi-task temporal shift attention networks for on-device contactless vitals measurement. Adv. Neural Inf. Process. Syst. 2020, 33, 19400–19411. [Google Scholar]
Narayanswamy, G.; Liu, Y.; Yang, Y.; Ma, C.; Liu, X.; McDuff, D.; Patel, S. Bigsmall: Efficient multi-task learning for disparate spatial and temporal physiological measurements. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 7914–7924. [Google Scholar]
Huo, C.; Yin, P.; Fu, B. MultiPhys: Heterogeneous Fusion of Mamba and Transformer for Video-Based Multi-Task Physiological Measurement. Sensors 2024, 25, 100. [Google Scholar] [CrossRef]
McDuff, D.J.; Estepp, J.R.; Piasecki, A.M.; Blackford, E.B. A survey of remote optical photoplethysmographic imaging methods. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 6398–6404. [Google Scholar]
Zagermann, J.; Pfeil, U.; Reiterer, H. Studying eye movements as a basis for measuring cognitive load. In Proceedings of the Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–6. [Google Scholar]
Shi, L.; Zhong, B.; Liang, Q.; Hu, X.; Mo, Z.; Song, S. Mamba Adapter: Efficient Multi-Modal Fusion for Vision-Language Tracking. IEEE Trans. Circuits Syst. Video Technol. 2025; early access. [Google Scholar]
Philipp, G.; Song, D.; Carbonell, J.G. The exploding gradient problem demystified-definition, prevalence, impact, origin, tradeoffs, and solutions. arXiv 2017, arXiv:1712.05577. [Google Scholar]

Figure 1. The overall architecture, including feature extraction and alignment, bidirectional feature interaction, and bidirectional feature fusion.

Figure 2. Internal structure of three types of feature embedding.

Figure 3. Mamba encoder.

Figure 4. Comparison of test results on the eDream dataset when only one feature is retained.

Figure 5. Comparison of test results on the MCDD dataset when only one feature is retained.

Figure 6. Impact of the number of Mamba layers.

Figure 7. Sampling the impact of facial movement on HR and RR prediction results.

Table 1. Summary of symbols and descriptions.

Symbol	Description
L	Length of the input time window (number of frames, e.g., 300)
D	Total concatenated feature dimension after alignment
$F_{l e y e}, F_{r e y e}, F_{m o u t h}$	Feature sequences extracted from left eye, right eye, and mouth
$F_{f a c i a l}$	Feature sequence extracted from 106-point facial landmarks
$F_{s t m a p}$	Feature sequence extracted from STMap (spatio-temporal map)
F	Concatenated multi-region feature sequence $\in R^{L \times D}$
$H_{forward}, H_{backward}$	Forward and backward Mamba-encoded features $\in R^{L \times D^{'}}$
$Z_{forward}, Z_{backward}$	Projected embeddings via linear layers after Mamba
$H_{bi}$	Concatenated bidirectional features $\in R^{L \times 2 D^{''}}$
$h_{fused}$	Temporal-aggregated global representation via mean pooling
${\hat{y}}_{h r}, {\hat{y}}_{r r}, {\hat{y}}_{c o g}$	Predicted heart rate, respiration rate, and cognitive load
$y_{h r}, y_{r r}, y_{c o g}$	Ground truth labels for HR, RR, and cognitive load
$L_{h r}, L_{r r}$	Smooth L1 losses for HR and RR regression
$L_{c o g}$	Truncated cross-entropy loss for cognitive load classification
$λ$	Adaptation weight

Table 2. Cognitive load estimation performance comparison on the eDream and MCDD datasets.

Method	eDream Dataset				MCDD Dataset
Method	Acc (%)	F1 (%)	Sens (%)	Spec (%)	Acc (%)	F1 (%)	Sens (%)	Spec (%)
CNN [17]	65.23	53.18	58.97	71.02	64.01	52.05	58.04	69.56
LSTM [17]	66.84	54.79	60.98	72.71	65.62	53.63	59.95	71.34
CNN + SVM [17]	68.35	56.17	62.94	74.28	67.14	55.02	61.97	72.73
AE + SVM [17]	69.52	57.63	64.48	75.55	68.37	56.41	63.49	74.21
ResNet3D [67]	77.18	65.19	70.96	81.05	72.04	61.10	68.33	73.80
ViViT [68]	78.46	66.97	72.98	82.16	71.76	56.57	57.25	78.63
CLERA [51]	75.82	62.95	68.94	79.53	75.09	62.07	68.02	78.96
VDMoE [29]	79.89	68.96	86.51	77.59	79.96	68.81	77.13	80.77
VideoMamba [69]	80.15	70.25	72.50	84.33	73.00	58.52	58.48	71.11
CogMamrba	83.56	74.48	88.25	82.19	81.22	71.62	79.33	84.02

Table 3. Physiological signal estimation performance on the eDream dataset.

Method	MAE ↓	RMSE ↓	p ↑	MAE ↓	RMSE ↓	p ↑
Method	Heart Rate (HR)			Respiration Rate (RR)
CHROM [70]	20.31	25.71	0.21	—	—	—
POS [71]	18.53	22.42	0.25	—	—	—
ARM-RR [72]	—	—	—	4.24	5.11	0.26
Dual-GAN [73]	15.85	18.52	0.47	—	—	—
ConDiff-rPPG [30]	15.41	18.72	0.49	—	—	—
HSRD [33]	14.82	17.91	0.51	—	—	—
DG-rPPG [34]	14.22	17.22	0.53	—	—	—
MTTS-CAN [74]	13.62	16.52	0.57	3.62	4.32	0.31
BigSmall [75]	12.91	15.91	0.58	3.42	4.11	0.32
MultiPhys [76]	12.51	15.41	0.60	3.32	4.01	0.32
PhysMLE [22]	11.91	14.81	0.62	3.22	3.91	0.33
ResNet3D [67]	10.50	14.00	0.65	2.91	3.82	0.37
ViViT [68]	10.21	13.70	0.65	2.82	3.72	0.38
VDMoE [29]	9.17	13.80	0.65	2.53	3.02	0.34
VideoMamba [69]	9.83	12.72	0.66	2.45	3.57	0.37
CogMamba	8.97	13.10	0.66	2.20	2.69	0.39

Notes: In this and the following tables, ‘—’ means there are no evaluation results.

Table 4. Physiological signal estimation performance on the MCDD dataset.

Method	Heart Rate (HR)			Respiration Rate (RR)
Method	MAE ↓	RMSE ↓	p ↑	MAE ↓	RMSE ↓	p ↑
CHROM [70]	18.33	19.70	0.20	—	—	—
POS [71]	17.33	20.02	0.22	—	—	—
ARM-RR [72]	—	—	—	7.32	9.16	0.11
Dual-GAN [73]	13.29	18.86	0.31	—	—	—
ConDiff-rPPG [30]	15.32	19.58	0.29	—	—	—
HSRD [33]	13.65	17.29	0.30	—	—	—
DG-rPPG [34]	12.97	17.58	0.31	—	—	—
MTTS-CAN [74]	13.96	18.31	0.30	5.22	6.68	0.37
BigSmall [75]	13.13	18.02	0.32	5.26	6.72	0.37
MultiPhys [76]	12.17	18.80	0.45	5.53	7.02	0.38
PhysMLE [22]	12.03	17.11	0.46	5.12	7.03	0.38
ResNet3D [67]	14.27	19.02	0.29	5.63	6.58	0.14
ViViT [68]	14.92	19.61	0.28	5.33	6.10	0.16
VDMoE [29]	10.32	15.37	0.53	4.98	6.53	0.45
VideoMamba [69]	14.02	18.65	0.42	4.53	6.02	0.36
CogMamba	10.04	14.20	0.56	4.18	5.39	0.51

Table 5. Ablation study on the eDream dataset: cognitive load, HR, and RR estimation.

Model Variant	Cognitive Load		Heart Rate (HR)		Respiration Rate (RR)
Model Variant	Acc (%)	F1 (%)	MAE ↓	p ↑	MAE ↓	p ↑
ResNet3D [67]	77.18	65.19	10.50	0.65	2.91	0.37
ViViT [68]	78.46	66.97	10.21	0.65	2.82	0.38
CogMamba w/o $X^{stmap}$	82.56	73.65	15.12	0.41	6.41	0.12
CogMamba w/o $X^{facial}$	81.23	71.23	9.11	0.65	2.61	0.35
CogMamba w/o $X^{leye}$	70.54	62.01	9.01	0.66	2.34	0.38
CogMamba w/o $X^{reye}$	70.76	62.23	9.02	0.65	2.35	0.37
CogMamba w/o $X^{mouth}$	75.54	70.01	8.99	0.66	2.27	0.39
CogMamba (Full)	83.56	74.48	8.97	0.66	2.20	0.39

Table 6. Ablation study on the MCDD dataset: cognitive load, HR, and RR estimation.

Model Variant	Cognitive Load		Heart Rate (HR)		Respiration Rate (RR)
Model Variant	Acc (%)	F1 (%)	MAE ↓	p ↑	MAE ↓	p ↑
ResNet3D [67]	72.04	61.10	14.27	0.29	5.63	0.14
ViViT [68]	71.76	56.57	14.92	0.28	5.33	0.16
CogMamba w/o $X^{stmap}$	80.23	70.62	16.22	0.12	8.31	0.08
CogMamba w/o $X^{facial}$	78.90	70.00	10.11	0.54	4.61	0.50
CogMamba w/o $X^{leye}$	69.56	60.12	10.11	0.55	4.24	0.50
CogMamba w/o $X^{reye}$	69.70	60.28	10.12	0.55	4.25	0.50
CogMamba w/o $X^{mouth}$	75.50	66.91	10.05	0.56	4.27	0.51
CogMamba (Full)	81.22	71.62	10.04	0.56	4.18	0.51

Table 7. Cross-dataset experiment of training on the eDream and testing on the MCDD dataset.

Method	Cognitive Load		Heart Rate (HR)		Respiration Rate (RR)
Method	Acc (%)	F1 (%)	MAE ↓	p ↑	MAE ↓	p ↑
ResNet3D [67]	57.03	52.63	17.07	0.424	4.96	0.236
ViViT [68]	58.92	51.88	16.26	0.420	4.31	0.226
VDMoE [29]	65.97	60.17	11.92	0.456	3.29	0.238
HSRD [33]	—	—	13.65	0.455	3.79	0.255
DG-rPPG [34]	—	—	13.27	0.455	3.67	0.264
MultiPhys [76]	—	—	16.26	0.420	4.31	0.226
PhysMLE [22]	—	—	15.48	0.434	4.18	0.233
CogMamba	68.49	62.14	12.66	0.412	3.88	0.243

Table 8. Cross-dataset experiment of training on the MCDD and testing on the eDream dataset.

Method	Cognitive Load		Heart Rate (HR)		Respiration Rate (RR)
Method	Acc (%)	F1 (%)	MAE ↓	p ↑	MAE ↓	p ↑
ResNet3D [67]	55.43	52.77	18.55	0.204	7.32	0.098
ViViT [68]	56.23	59.60	19.40	0.197	6.93	0.113
VDMoE [29]	65.97	58.17	17.75	0.210	5.20	0.210
HSRD [33]	—	—	13.42	0.371	6.47	0.315
DG-rPPG [34]	—	—	13.05	0.392	5.43	0.357
MultiPhys [76]	—	—	15.82	0.315	7.19	0.266
PhysMLE [22]	—	—	15.64	0.322	6.66	0.266
CogMamba	66.85	60.13	16.86	0.217	5.72	0.203

Table 9. Computational cost comparison of different models.

Model	Param (M)	FLOPs (G)	Inference Time (ms)
ResNet3D [67]	33.37	40.70	5.57
ViViT [68]	87.35	283.06	29.81
VideoMamba [69]	27.65	28.33	4.33
VDMoE [29]	7.34	4.05	1.91
PhysMLE [22]	24.82	34.57	4.96
HSRD [33]	12.16	4.94	1.99
CogMamba	5.95	3.17	1.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, Y.; Guo, B. CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features. Sensors 2025, 25, 5620. https://doi.org/10.3390/s25185620

AMA Style

Xie Y, Guo B. CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features. Sensors. 2025; 25(18):5620. https://doi.org/10.3390/s25185620

Chicago/Turabian Style

Xie, Yicheng, and Bin Guo. 2025. "CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features" Sensors 25, no. 18: 5620. https://doi.org/10.3390/s25185620

APA Style

Xie, Y., & Guo, B. (2025). CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features. Sensors, 25(18), 5620. https://doi.org/10.3390/s25185620

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CogMamba: Multi-Task Driver Cognitive Load and Physiological Non-Contact Estimation with Multimodal Facial Features

Abstract

1. Introduction

2. Related Works

2.1. Contact-Based Cognitive Load Detection

2.2. Non-Contact-Based Cognitive Load Detection

2.3. Mamba

3. Methodology

3.1. Preliminaries

3.1.1. State Space Modeling and Discretization Principles

3.1.2. Advantages of the Convolution Equivalence Form and Mamba Model

3.2. Overview

3.3. Feature Extraction and Alignment

Data Processing

3.4. Bidirectional Feature Interaction

3.4.1. Feature Extraction and Alignment

3.4.2. Bidirectional Feature Fusion

3.5. Optimization Goal

4. Experiment

4.1. Datasets and Evaluation Metrics

4.2. Baselines

4.3. Implementation Details

4.4. Results of the Comparison Experiment

4.5. Results of Ablation Study

4.6. Impact of the Number of Mamba Layers

4.7. Cross-Dataset Estimation

4.8. Results of the Computational Cost Study

4.9. Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI