Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment

Hamoud, Batol; Othman, Walaa; Shilov, Nikolay; Kashevnik, Alexey

doi:10.3390/electronics14193789

Open AccessArticle

Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment

Saint Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3789; https://doi.org/10.3390/electronics14193789

Submission received: 6 August 2025 / Revised: 19 September 2025 / Accepted: 23 September 2025 / Published: 24 September 2025

(This article belongs to the Special Issue Deep Learning Applications on Human Activity Recognition)

Download

Browse Figures

Versions Notes

Abstract

This study addresses mental fatigue as a critical state arising from prolonged human activity and positions its detection as a valuable task within the broader scope of human activity recognition using deep learning. This work compares two models for mental fatigue detection: a model that uses eye-tracking data for fatigue predictions and a vision-based model that relies on vital signs and human activity indicators from facial video using deep learning and computer vision techniques. The eye-tracking model (based on TabNet architecture) achieved 82% accuracy, while the vision-based model (features were estimated using deep learning and computer vision) based on Random Forest architecture reached 78% accuracy. A correlation analysis revealed strong alignment between both models’ predictions, with 21 out of 27 sessions showing significant positive correlations on the collected dataset. Further comparison with an earlier-developed vision-based model trained on another dataset supported the generalizability of the vision-based model using physiological indicators extracted from a facial video for fatigue estimation. These findings highlight the potential of the vision-based model as a practical alternative to sensor and special-devices-based systems, especially in settings where non-intrusiveness and scalability are critical.

Keywords:

mental fatigue detection; eye-tracking data; vision-based model; physiological indicator; machine learning; correlation analysis

1. Introduction

Mental fatigue detection plays a critical role within the broader domain of human activity recognition as it reflects internal cognitive responses to prolonged or demanding activities such as driving, operating machinery, or desk-based work. As mental load increases, it can lead to productivity drops, impaired judgment, and ill effects on overall well-being [1]. With recent technological advancements, organizations are now better equipped to monitor cognitive states such as fatigue in real time, enabling more proactive strategies to support employee focus, engagement, and long-term resilience.

Traditional methods of fatigue measurement are either based on self-report or sensor-based physiological recordings such as heart rate and electroencephalography (EEG) [2]. While effective, these measures are either intrusive, require special equipment, or are not suitable for continuous monitoring across a range of environments. Hence, growing interest has been invested in non-intrusive, automatic fatigue detection systems that can be passively and economically run at scale based on behavioral and visual cues from facial expressions or other observable measures [3].

Patterns of eye movements have proven to be very informative for fatigue measurement because they directly reflect changes in attention and cognitive control. Stability of fixation, saccade dynamics, and pupil behavior are some of the parameters that have been associated with mental fatigue and are easily recorded using eye-tracking sensors and devices [4]. However, such systems generally depend on specialized hardware and are, therefore, less feasible for practical implementation. Recently, advances in computer vision and machine learning have enabled the estimation of physiological signals (e.g., heart rate and respiration rate) [5,6] from standard facial video, representing a promising path for scalable, contactless fatigue measurement.

In this study, we compare two models for fatigue detection: one is based on eye-tracking data and the other uses physiological indicators estimated from facial video. The first model employs statistical features derived from eye movement signals and utilizes a deep learning framework optimized for tabular data. The second model estimates fatigue from vision-based physiological indicators using deep learning and computer vision techniques and applies a classical machine learning model. By conducting a detailed correlation analysis between the predictions of these models on continuous data, we aim to investigate the following:

Cross-Validation of Methodological Robustness: By deploying the vision-based model in a novel collected dataset, we aim to assess the robustness and reproducibility of the earlier-developed vision-based model trained on another dataset collected under different experimental conditions. This is essential for verifying that the model is not overfitted to a specific protocol, subject group, or recording environment.
Functional Comparison to Advanced-Devices-Based Models: Our previously developed vision-based model has not been functionally compared to real-time, advanced-devices-based measures of fatigue. In this study, the availability of eye-tracking-derived fatigue predictions enables a functional comparison between two distinct modalities: eye movement patterns collected with advanced device and vision-estimated vital signs and human activity indicators. This comparison allows us to test for convergent validity whether different definitions and measurement techniques of fatigue produce consistent trends over time.
Validation of Contactless Alternatives in Practical Use Cases: In real-world applications, contactless monitoring is often preferred due to its scalability and minimal user burden. Proving that the vision-based model maintains predictive value in a new dataset, and aligns with eye-tracking model outputs, provides compelling evidence for its use as a viable alternative or supplement to advanced-devices-based systems.

Compared with sensor-based approaches such as EEG, photoplethysmography (PPG), or using inertial measurement units (IMUs), our framework aims and offers the advantage of non-intrusiveness and minimal setup requirements, making it well suited for settings where camera infrastructure is available (e.g., vehicle cabins, workstations, or digital devices). However, we recognize that wearable sensors, while often more intrusive and sometimes impractical for continuous long-term use, provide reliable physiological signals independent of visibility and illumination, which makes them valuable in environments where camera-based monitoring is impractical. Therefore, our contributions are as follows:

We develop and evaluate a fatigue detection model based on eye-tracking data using statistical features extracted from eye movement data;
We implement a vision-based model and assess its performance on a constrained subset of the data to test its robustness and generalizability;
We conduct a comprehensive correlation analysis to explore the convergence of predictions between the two models and validate their consistency.

We use the term eye-movement-based model since the dataset used in our study was collected using a specialized eye-tracking device equipped with two advanced, high-cost cameras designed to monitor eye movement with high precision. Our focus is to demonstrate the effectiveness of a low-cost, fully contactless alternative. Specifically, we present a vision-based model that relies solely on input from a standard camera—such as those commonly found on laptops. We refer to this as the physiological indicator model or vision-based model as it estimates vital signs and behavioral indicators using deep learning and computer vision techniques. For the vision-based model developed in the current work, we write the vision-based model, and for the model developed in the previous work [7], we write the earlier vision-based model.

The paper is structured as follows: Section 2 describes several approaches proposed for the task of fatigue detection. Section 3 includes an overview of the earlier vision-based model, the overall research structure, and the datasets used. Section 4 outlines the experiments that we conducted to develop the proposed models and the correlation analysis implemented with the achieved results; it is followed by Section 5, where we discuss the results in detail with the limitation faced. Finally, Section 6 presents the conclusion of this research.

2. Literature Review

Extensive research has examined mental fatigue, addressing its classification, underlying neural mechanisms, and standardized assessment criteria. This work can be broadly categorized into five domains: active detection methods, behavioral feature assessments, physiological signal detection, biochemical marker analysis, and multimodal fusion approaches [8]. Our study focuses on fatigue detection using behavioral features and physiological signals, reflecting the types of features and data analyzed in this work. This section reviews methodologies and approaches that use sensor-derived data to investigate links between fatigue and vital signs such as respiratory rate, blood pressure, and heart rate variability (HRV) in addition to the contactless approaches that predominantly detect fatigue through facial feature estimation, including eye and mouth movements, and head posture. Advances in deep learning and computer vision have significantly improved the accuracy of these estimations, reducing concerns about discrepancies between estimated and directly measured values.

2.1. Fatigue Estimation via Sensors and Advanced-Device Data

Study [9] introduces a non-invasive intelligent cushion system developed to evaluate mental fatigue in construction equipment operators by continuously monitoring heart rate and respiratory signals in real time. Features in both the time and frequency domains were extracted from the recorded heart rate and respiration signals to train a Random Forest classification model. The objective was to investigate the relationship between physiological indicators and self-reported mental fatigue, as assessed by the NASA Task Load Index (NASA-TLX) [10]. The model demonstrated a classification accuracy of 92%, with results showing that the integration of heart rate and respiration features yielded superior performance in detecting and classifying mental fatigue compared to using either signal alone.

Another study [11] demonstrates the feasibility of using multimodal wearable sensor data, specifically such features as heart rate, heart rate variability, respiratory rate, energy expenditure, activity counts, and step count. These data are combined with machine learning for fatigue estimation. After imputing missing data using a recurrent neural network, both supervised and unsupervised approaches were evaluated. The best performance was achieved using a causal convolutional neural network with a Random Forest classifier (precision = 0.70; recall = 0.73). Vital signs contributed most to predicting mental fatigue, while both activity and physiological features were important for physical fatigue. These results support fatigue as a multimodal construct and offer a foundation for scalable, data-driven fatigue monitoring in daily life.

Article [12] investigates the relationship between physiological parameters and fatigue in labor employees. In a controlled laboratory setting, heart rate, respiratory rate interval, respiratory rate, and blood pressure were measured before and after induced fatigue. Using paired samples t-tests in SPSS (https://www.ibm.com/products/spss-statistics, accessed on 19 September 2025), the analysis revealed no significant changes in heart rate and respiratory rate interval, whereas both the respiratory rate and blood pressure showed significant changes, specifically, a decrease in respiration and an increase in blood pressure following fatigue.

Regarding the relationship between brain activity and mental fatigue, research [13] focuses on the detection of mental fatigue in construction site operators by means of wearable EEG sensors embedded in flexible headbands, combined with deep learning techniques. Mental fatigue levels were quantified according to the NASA-TLX scale as reference standard. Raw EEG signals obtained were used for training and validation of different models of deep learning, including long short-term memory (LSTM) [14], bidirectional LSTM (Bi-LSTM) [15], and one-dimensional convolutional neural networks. In the results, Bi-LSTM performed the best in the classification task, with a 99.941% accuracy.

Recent advances in probabilistic machine learning have demonstrated significant potential for enhancing the accuracy of driver drowsiness detection. In one notable investigation [16], researchers developed a novel framework leveraging EEG and heart rate data, achieving a reported 100% accuracy in classifying states of wakefulness. Their methodology involved preprocessing EEG signals, extracting power spectral densities, and modeling their relationship with the heart rate via Support Vector Regression (SVR), followed by state classification using Bayesian Support Vector Classification (SVC). While these results are exceptional, the authors crucially note that an analysis of the predicted class probabilities suggests the influence of latent factors beyond the included physiological measures. This underscores a key limitation of relying solely on EEG and heart rate data while simultaneously highlighting the promise of probabilistic approaches for robust drowsiness detection.

Study [17] focused on aviation safety. Researchers employed a non-invasive brain–computer interface (BCI) to decode pilot mental states from electroencephalogram (EEG) data within a simulated flight environment. The study’s significant contribution lies in its move beyond a simple alert-versus-drowsy dichotomy to the detailed classification of five distinct drowsiness levels, using exclusively EEG signals. To accomplish this, the authors developed a sophisticated deep spatio-temporal convolutional bidirectional long short-term memory network (DSTCLN) model. When validated against the Karolinska Sleepiness Scale (KSS), the model demonstrated robust performance, achieving a grand-averaged accuracy of 87% for the two-state classification and a promising 69% for the more challenging five-level classification.

Another relevant indicator for fatigue detection is head movement. Therefore, the authors of [18] aimed to evaluate the feasibility of fatigue detection using head movement signals captured by a single-axis ± 1 g MEMS accelerometer (miniaturized sensor for measuring acceleration and motion). By comparing patterns of active and drowsy head movements, head nodding is examined as a behavioral indicator of fatigue. The findings indicate that fatigue onset can be reliably identified through analysis of these signals, with individual-level detection achievable using a 4% threshold based on the normalized fractal dimension.

Heart rate variability (HRV) has also been shown to be an effective physiological marker of mental fatigue caused by extended cognitive activities such as driving. Long-term time on task is typically associated with reduced parasympathetic activity, which is reflected in reduced high-frequency components (HF) and time-domain values such as rMSSD and pNN50 [19]. Many studies have also shown a repeated increase in low-frequency (LF) components and time-domain variables, particularly rMSSD [20]. Moreover, mental fatigue is most often associated with a reduced state of arousal consistent with high vagal activity, as evidenced by changes in HRV patterns [21]. Such findings support the use of HRV as an objective and non-invasive approach for monitoring mental fatigue across a range of real-world scenarios.

Eye-related metrics have also gained increasing attention in the detection and analysis of fatigue. The existing literature indicates that increased cognitive load is commonly associated with physiological changes such as pupil dilation, increased blink frequency, and a decrease in mean relative fixation duration [22]. Similar findings were reported by [23], who documented a 91% increase in pupil diameter and a substantial 31.31% reduction in fixation time, alongside an approximate 40% decrease in saccade distance. Furthermore, the authors of [24] identified a significant positive correlation between mental fatigue and blink rate, confirming the relevance of ocular behavior as a sensitive indicator of cognitive workload and fatigue.

2.2. Fatigue Estimation via Features Estimated Using Computer Vision and Deep Learning

As mentioned before, contactless fatigue detection methods increasingly depend on the estimation of facial features such as eye behavior, mouth movements, and head posture. These estimations are made through deep learning and computer vision, which have proven highly effective in accurately extracting these features that enhance the feasibility of deploying non-intrusive fatigue monitoring systems in real-world settings, ensuring both accuracy and user convenience.

The study presented in [25] introduces an efficient neural network model adapted to detect driver fatigue in real time. The proposed system integrates two core components: object detection and fatigue evaluation. A lightweight detection network is first employed to identify the eye and mouth states (specifically, whether they are open or closed) directly from video input. These observations are then processed by the EYE–MOUTH (EM) fatigue detection module, which encodes the visual data and calculates two key indicators: the Percentage of Eyelid Closure over the Pupil (PERCLOS) and the Frequency of Open Mouth (FOM). To determine the driver’s fatigue status, a multi-feature fusion decision algorithm is applied. The model demonstrated strong performance, achieving an accuracy of 98.30%.

In [26], the authors developed a comprehensive system for detecting driver drowsiness, employing a multi-model deep learning framework specifically designed to identify fatigue-related indicators. The system leverages four deep neural networks: AlexNet [27], VGG-FaceNet [28], FlowImageNet, and ResNet to analyze videos of drivers. It extracts and processes features across four domains: facial expressions, hand gestures, head movements, and behavioral cues. AlexNet is used to handle variability in lighting and environmental conditions, while VGG-FaceNet captures facial attributes, including gender and ethnicity. FlowImageNet focuses on interpreting behavioral patterns and head movements, and ResNet is used to detect hand gestures. The extracted features are categorized into four fatigue states: alertness, drowsiness with eye blinking, yawning, and nodding. An ensemble learning mechanism integrates the outputs of these models, with the final classification performed using a SoftMax layer. The system achieved an overall accuracy of 85%, demonstrating its potential for real-time driver fatigue monitoring.

Paper [29] presents a driver fatigue detection system that integrates a Residual Channel Attention Network (RCAN) [30] with 3D head pose estimation. RetinaFace is first used for face localization and landmark detection, followed by RCAN for classifying eye and mouth states. The network’s channel attention mechanism enhances feature extraction, leading to high classification accuracies—98.962% for eye states and 98.561% for mouth states—surpassing conventional CNNs (convolutional neural networks). Fatigue is assessed using the Percentage of Eyelid Closure over Time (PERCLOS) and the Percentage of Mouth Opening (POM). To complement behavioral cues, head pose is estimated via the Perspective-n-Point (PnP) method [31], with an over-angle metric used to detect abnormal head deflections. Evaluation across four datasets confirms the system’s effectiveness for robust driver fatigue monitoring.

In [32], the authors employed a pre-trained model based on a histogram of oriented gradients (HOG) [33] combined with a linear support vector machine (SVM) to detect key facial landmarks, including the eyes, nose, and mouth. From these features, three key ratios were computed: the eye aspect ratio (EAR), mouth opening ratio (MOR), and nose length ratio (NLR). These metrics served as indicators for drowsiness detection within video frames. The system initially applied adaptive thresholding to classify blinking, yawning, and head tilting behaviors, followed by the use of several machine learning classifiers to distinguish between drowsy and alert states. Among the tested models, the SVM achieved the highest accuracy, reaching 96.4%, demonstrating its effectiveness for real-time driver drowsiness detection.

Finally, some reseachers explored the integration of features collected from sensors and videos together. An example of this approach is the novel approach proposed by [34], which combines three machine learning models with multimodal data fusion to classify mental fatigue levels. Data were collected during simulated excavation tasks using a combination of EEG, electrodermal activity (EDA), and facial video recordings. EEG signals were captured via a wearable headband, EDA data were gathered using an E4 wrist-worn device, and facial features were recorded through a camera mounted on the excavator’s front panel. In addition, participants provided subjective evaluations of their mental fatigue through a questionnaire. Among the models tested, the decision tree classifier integrating fused sensor data outperformed others, reaching an accuracy of 96.2%. These results highlight the effectiveness of multimodal sensor fusion for enhancing the reliability of mental fatigue detection and support its potential application in real-time monitoring systems.

In conclusion, sensor-based approaches have been important to gain insights into fatigue with clear relationship established between physiological indicators and fatigue states. These methods have paved the way for the development of more proficient and intelligent fatigue-detection systems since their practical use is minimal because of the requirement of continuous body contact or sensors attached to the skin or head. There is also a heavy trend in contactless fatigue detection that focuses majorly on facial features, especially the eyes and mouth. These techniques may be highly accurate and effective; however, their ability may be hindered by a number of factors, including a person wearing masks or sunglasses, which may prevent a real-world implementation of the system.

Our previous work sought to address these challenges by developing a vision-based fatigue detection model that estimates vital signs and human activity indicators using deep learning and computer vision techniques [7]. To reduce the high computational cost associated with deep convolutional and transformer-based architectures, we subsequently applied feature analysis to identify the most critical predictors, thereby improving efficiency without compromising performance [35]. These efforts demonstrated the feasibility of contactless, vision-based fatigue monitoring and highlighted its potential advantages over sensor-dependent systems.

Nevertheless, important gaps remain. Specifically, existing vision-based approaches, including our own prior work, have not been extensively validated on datasets collected under different experimental conditions, such as the differences in the type and intensity of tasks performed, the duration of the tasks, and the specific metrics used to measure fatigue levels, leaving questions about their robustness and generalizability. Moreover, there is a lack of studies that conduct functional comparisons between vision-derived physiological indicators and device-based fatigue measures such as eye-tracking signals. This lack of cross-modal validation limits confidence in the applicability of vision-based methods as practical alternatives to advanced devices.

To address this gap, we present a study that investigates fatigue detection during prolonged cognitive tasks using two distinct strategies: (i) a device-based model that uses statistical features from eye movement data and (ii) a vision-based model that estimates physiological indicators from standard video to detect mental fatigue. By evaluating both models on the same dataset and analyzing the consistency of their predictions, we aim to provide new insights into the robustness, convergence, and practical feasibility of contactless fatigue detection.

3. Methodology

In this section, we present a detailed description of our methodology, beginning with an overview of the proposed structure followed by a summary of the earlier vision-based model that forms the basis of our current evaluation. This includes an outline of the deep learning models employed for estimating vital signs and human activity indicators. Additionally, we provide information on both the original dataset used in the development of the initial model and the collected dataset introduced in this study to assess its robustness and generalizability.

3.1. Proposed Models for Fatigue Assessment

The overall structure of this research involves the use of two disparate models to fatigue detection: an eye-movement-based model and a vision-based model that is fed with vital signs and human activity indicators estimated using deep learning and computer vision techniques (see Figure 1). The eye-movement-based model is based on the TabNet architecture fed with statistical features of the eye movement data, while the developed vision-based model is based on the Random Forest architecture, and it uses the physiological indicators estimated from videos to detect the mental fatigue level. The primary objective is to evaluate the consistency and reliability of the contactless, video-driven model by comparing it against a more conventional model that uses features collected through an advanced device. To check the validity of outcomes and eliminate overfitting, both models will be applied to an independent test set not included in the training. This assessment will comprise a detailed analysis of the correlation between predictions made using the two different models and an attempt to identify whether both models capture similar patterns of fatigue. Such an evaluation is necessary to determine the functional equivalence between measurements collected from special devices and estimated features from visual information, as well as to verify the applicability of the vision-based models in real settings.

To further validate the reliability of our earlier vision-based model, we also conduct a correlation analysis between the predictions generated by the eye-tracking data model and those produced by our earlier vision-based model trained on another dataset (see Section 3.2.1). Notably, the earlier-developed vision model has not been exposed to data from the participants in the collected dataset, allowing us to assess its generalization capability across different populations. While a strong correlation is not necessarily expected, given variations in data sources and individual differences, the presence of a meaningful correlation would nonetheless provide supporting evidence for the applicability of physiological indicators estimated through deep learning in fatigue detection. This analysis serves as an additional step toward establishing the robustness and relevance of the vision-based models.

3.2. Earlier-Developed Vision-Based Method

We explain the details of the earlier vision-based model that is the core of our current work by outlining the deep learning models that are used to extract vital signs and human activity indicators, which serve as features for fatigue detection.

3.2.1. Mental Fatigue Estimation Method Using Vital Signs and Human Activity

Our previously proposed method [7] starts with extracting vital signs and human activity indicators from video recordings of operators using a set of dedicated deep learning models. These indicators include heart rate, respiratory rate, blood pressure, oxygen saturation, eye closure ratio, and head pose, among others. They are computed at one-minute intervals to enable continuous monitoring within the fatigue detection framework. The extracted features are then fed into a classification model designed to assess fatigue levels, with mental performance measured via the Landolt rings test, serving as the reference for the cognitive state. The underlying assumption is that mental performance declines as fatigue increases. To identify the most effective predictive model, we evaluated several machine learning algorithms, including Support Vector Classifier (SVC) [36], logistic regression [37], Multi-Layer Perceptron (MLP) [38], decision tree [39], XGBoost [40], and Random Forest [41]. Among these, the Random Forest model demonstrated the best performance, achieving an F1-score of 0.947 in predicting fatigue based on vital signs and human activity extracted from video data.

To make our model more efficient, we employed feature importance techniques to reveal the most effective physiological indicators of fatigue [35]. In the results, heart rate, oxygen saturation, blood pressure (systolic and diastolic), and average pitch consistently emerged as the most critical indicators. This not only enhanced our model’s performance but also provided important insights into the most crucial factors behind fatigue. In addition, we went beyond conventional machine learning models, e.g., Random Forest, and used a more recent architecture, e.g., the Tabular Transformer. This allowed for stronger generalization and better processing of structured data, achieving an accuracy of 89%. The findings indicate the promise of more accurate and trustworthy detection of mental fatigue in real-world, practical applications. The earlier vision-based model is shown in Figure 2.

3.2.2. Deep Learning Models for Physiological Indicator Estimation

This subsection summarizes the models used to estimate vital signs and human activity indicators relevant to fatigue detection, highlighting their architectures, performance, and limitations.

Respiratory Rate and Breathing Characteristics: The respiratory rate model combines OpenPose for chest keypoint detection with SelFlow [42] for displacement analysis. Signal processing techniques further refine the output, achieving a mean absolute error (MAE) of 1.5 breaths per minute. However, the model is limited in dynamic settings, such as during movement.
Heart Rate: The heart rate estimation model employs facial region extraction followed by processing with a Vision Transformer [43]. A layered block structure calculates heart rate using a weighted average. While effective overall, the model underperforms for extreme heart rates due to limited training data coverage.
Blood Pressure: Blood pressure estimation begins by identifying cheek regions in video frames. Features are extracted using EfficientNet and fed into LSTM layers. The model reports MAEs of 11.8 mmHg (systolic) and 10.7 mmHg (diastolic), with accuracies of 89.5% and 86.2%. Skin tone diversity in the dataset remains a limitation.
Oxygen Saturation: The $S p O_{2}$ estimation model uses 3DDFA_V2 [44] for face detection, VGG19 for feature extraction, and XGBoost for regression. It achieves MAEs of 1.17% and 0.84% on two datasets. However, it lacks sufficient low $S p O_{2}$ samples, affecting performance for certain clinical cases.
Head Pose: Head pose is estimated via face detection using YOLO Tiny [45], followed by 3D face reconstruction and landmark tracking to compute Euler angles. The model is effective but constrained to head angles under 70, and performance is slower than other approaches.
Eye and Mouth States: Eye state detection is based on facial inputs from FaceBoxes, while the mouth state is classified using a modified MobileNet, achieving 95.2% accuracy. Despite high performance, the use of a private dataset for training may limit generalizability.

We would like to note that these articles have already been reviewed and cited in our prior work [7], where their details are provided comprehensively.

3.3. Proposed Methods

In this section, we provide an overview of the eye-tracking-based method and the vision-based method by outlining the features used for training and the technical ground behind choosing specific models for the fatigue assessment task.

3.3.1. Eye-Tracking-Based Method

The dataset used in this study, as will be detailed in a later section, contains continuous eye-movement-tracking data collected from participants during computer-based tasks. To assess fatigue levels at a per-minute resolution, relevant features are extracted from the provided data.

The feature extraction procedure focused on generating meaningful indicators from the x and y coordinate time series representing eye movements for each activity. Seven of the most crucial statistical features were computed at one-minute intervals: mean, standard deviation, minimum, maximum, 25th percentile, median (50th percentile), and 75th percentile. This was implemented using standard Python (3.8) and NumPy library (1.22.0). The resulting feature set provides a structured representation of eye movement dynamics, supporting the evaluation of cognitive states associated with fatigue.

We employ TabNet as the predictive model. TabNet is a deep learning architecture specifically designed for tabular data. It integrates principles from both decision trees and neural networks, enabling it to learn complex feature interactions while preserving a degree of interpretability. Its key advantages include the following:

Feature-wise attention, allowing the model to focus on the most relevant indicators of fatigue;
Built-in interpretability, offering insights into which features influence predictions;
Strong performance on small datasets, which is ideal given our limited sample size;
Improved generalization, helping the model remain robust across different individuals and sessions.

These reasons made it a strong candidate for fatigue estimation in this context. We used the pyTorch (2.8.0) implementation of Tabnet proposed by [46] (pytorch-tabnet 4.1.0). Moreover, we implemented a six-fold cross-validation approach to ensure robust evaluation and generalizability of the model across different individuals, as will be explained in detail in a further section. Figure 3 shows the overall scheme of this method.

3.3.2. New Vision-Based Method

For this phase of the study, we build upon our earlier vision-based model, incorporating enhancements guided by prior feature importance analysis. In particular, we estimate key physiological indicators—heart rate, blood pressure, oxygen saturation, and average head pitch—at one-minute intervals from the newly collected video dataset. These minute-level estimates are then used as input features for fatigue prediction.

To model the relationship between these physiological indicators and fatigue levels, we employ a Random Forest classifier. The decision to use Random Forest, rather the Tabular Transformer as in our latest enhancement, is motivated by several practical and methodological considerations:

Random Forests perform well even with relatively small datasets, which is advantageous given the limited size of our current annotated data.
The ensemble nature of Random Forest reduces the risk of overfitting, particularly important when dealing with noisy or estimated physiological features.
Compared to transformer-based models, Random Forests are faster to train and require less computational cost.

In addition, Random Forest achieved the highest accuracy in the fatigue detection model in our earlier vision-based model [7] before the enhancement made with feature importance analysis [35]. Figure 4 summarizes the process of the new vision-based method. The details of training and testing are included in the next chapter.

To assess the model’s generalizability and its alignment with the eye-tracking-based predictions, we trained it on a limited subset of the collected dataset to increase the challenge and test the model’s ability to distinguish fatigue levels under constrained conditions.

3.4. Datasets

To introduce the datasets used, we begin by discussing the dataset used in our previous work to train the earlier vision-based fatigue detection model. Although this dataset is not directly employed in the current experiments, its inclusion is essential for understanding the foundation of the model being evaluated. The dataset offers synchronized video recordings alongside cognitive performance scores, which enabled the development of a contactless fatigue detection model based on estimated physiological indicators. The second dataset, used for validation, includes eye-tracking data collected by special device from a different participant group and experimental setting. We aim to assess the reliability and generalizability of our model. The correlation analysis between the models outputs serves as an important step in evaluating whether the deep-learning-based physiological estimations capture fatigue-related patterns consistent with direct eye-tracking measurements.

3.4.1. OperatorEYEVP Dataset

In our previous work [7], we employed the OperatorEYEVP dataset, originally introduced in [47], to develop a contactless fatigue detection system. This dataset comprises video recordings from ten participants performing a series of cognitive and routine tasks at three different times of day over a period of 8–10 days. Each daily session began with a sleep quality survey, followed by the VAS-F questionnaire, a choice reaction time (CRT) task, reading a scientific passage, completing the Landolt rings correction test, playing Tetris, and a second CRT task. The average session lasted approximately one hour.

To assess fatigue, we used mental performance scores derived from the Landolt test, which reliably reflects cognitive and attentional decline associated with fatigue. Based on the experimental findings reported in [7], a performance threshold was defined: values below it indicated a fatigued state, while values above represented a non-fatigued condition.

3.4.2. Collected Dataset

The dataset for this research consists of recordings from 17 individual participants over 27 sessions, each of which lasted roughly three hours. These 17 participants are primarily male operators. This relatively homogeneous group was available for data collection sessions under controlled conditions. While the sample size may appear modest, it aligns with many studies and research on mental fatigue detection, which often use 10–20 participants [9,12,13,24,34]. Additionally, our previous work applied the same modeling approach to a different dataset containing both male and female participants and achieved good performance, suggesting that the method generalizes beyond the current sample. Moreover, our goal in this study was not only to evaluate performance on a new dataset but also to validate the approach by correlating predictions from our vision-based model with those obtained from the eye-tracking device. This design provides an additional layer of cross-modal validation and supports the robustness of our findings despite the relatively small sample size.

Every participant performed a Landolt rings test to assess the baseline for mental fatigue. Then, participants undergo two rounds of mental tasks that are specifically designed to challenge their attention and cognitive processing abilities. Each round lasts about 90 min, with a brief break in between that enables the participants to recover their mental concentration and performance. The break is included to help counteract cognitive fatigue and maximize the quality of data gathered. During the cognitive task sessions, the number of errors made by each participant is recorded. This measure also indicates the cognitive functioning used to gauge the participant’s state of mental fatigue. A low rate of error typically indicates continued attention and good cognitive performance, whereas a high error rate over the time can be an indication of a progressive reduction in attentional capacity, most probably due to mental fatigue. This noted relationship between task performance and cognitive state is the foundation of the ground-truth labeling employed in the current study.

4. Experiments and Results

This section presents the experimental procedures and results in a stepwise manner to clarify how each part of the analysis builds on the previous one. First, we evaluate the eye-movement-based model using statistical features from the x and y coordinates of eye-tracking data to establish a device-based reference for mental fatigue detection. Next, we apply our vision-based model to the same sessions, using physiological indicators estimated from standard video to test its performance under contactless conditions. We then conduct a correlation analysis to directly compare the predictions of the two models and to assess their cross-modal alignment over time with a comparison of the correlation analysis between the predictions made by the eye-movement-based model and our original vision-based model trained using a different dataset. Finally, we discuss the combined findings and highlight how these results support the feasibility of scalable, non-intrusive fatigue monitoring.

4.1. Fatigue Detection Using Eye Movement Characteristics

As mentioned before, the current dataset captures ocular-related parameters continuously over the course of each session, specifically including the x and y coordinates of eye movement tracking. The model adopted for fatigue detection builds upon our previous work in [48], where statistical features were derived from the eye movement data and employed within a machine learning framework to estimate fatigue levels.

The error patterns observed across the initial and subsequent phases of the task indicate that participants tend to display greater alertness and reduced signs of fatigue during the early minutes of the first round. In contrast, by the end of the second round, a marked increase in fatigue becomes evident, often accompanied by observable behavioral indicators of tiredness. Based on this trend, we selected these specific intervals as representative segments for training a model aimed at predicting fatigue using eye-movement-tracking data.

We labeled the first 15 min of the first round as low fatigue and the last 15 min of the second round as high fatigue. We implemented a six-fold cross-validation approach, where each fold involved training the model on data from 14 participants while testing it on the remaining three. In the final fold, the model was trained using data from 15 participants and tested on the last two.

For this part of our experiments, we choose TabNet-based model to train to detect fatigue. It is a deep learning model tailored for tabular data, making it well suited for this experiment involving statistical features from eye-tracking data.

The TabNet model has achieved a mean accuracy of 81% with the cross-validation folds. In addition, we tested the same model with a random split of the dataset where we used 80% of the samples for training and 20% for testing and achieved an accuracy of 82%. Such stability is a critical factor in our task as it indicates that TabNet is less susceptible to fluctuations in performance caused by variations in the dataset, making it a robust choice for predicting fatigue based on eye-movement-tracking data.

After evaluating the performance of the eye-tracking model, we next applied the vision-based model to the same sessions to examine whether physiological indicators extracted from facial video can achieve comparable fatigue detection. This step allows us to directly compare contact-based and contactless approaches within a consistent experimental framework.

4.2. Vision-Based Fatigue Detection Using Physiological Indicators

Now, we move forward to our second part of the experiments. The objective behind adopting the earlier vision-based model in the current study was to evaluate the generalizability, cross-context validity, and functional equivalence of a previously developed fatigue estimation framework in a new data environment.

In the training of this model, we selected a limited subset of the dataset to enhance the challenge for the model, thereby assessing its generalizability and alignment with predictions derived from the eye-tracking model. Specifically, the training was conducted using data from 11 participants, focusing on the initial ten minutes of the first round, which were characterized by low fatigue, and the final ten minutes of the second round, during which fatigue was more obvious.

We employed the same methodology previously established, utilizing the Random Forest model that demonstrated superior performance despite the inherent individual variability and differing health statuses of the participants, which could potentially influence the generalizability and robustness of the model. Notably, we opted to train on a duration of ten minutes, aiming to minimize the overlap of mutual knowledge between the two models while still maintaining satisfactory performance levels despite the constraints imposed by limited data availability. The best results are obtained with the following hyperparameters: n_estimators = 50, criterion = ‘entropy’ with accuracy = 78%, Precision = 80%, Recall = 72%, and F1-score = 76%.

Since we obtained predictions from both the eye-tracking and vision-based models, the following step is to perform a correlation analysis to assess how closely the two approaches converge over time.

4.3. Correlation Analysis for Validity Check

To assess the construct validity and functional consistency of both models in addition to the earlier vision-based model that was trained on different dataset, we conducted a session-wise correlation analysis between their predictions on unseen data (minute-level fatigue scores over 3-h continuous sessions). We applied a moving averaging of 5 min on three signals for noise reduction and trend identification, which is useful in time-series analysis where understanding how two signals relate over time is needed.

Figure 5 presents the correlation analysis between the predictions of the eye-movement-based model (Section 4.1) and the model trained on physiological indicators (Section 4.2). The results indicate that 21 out of 27 sessions demonstrated a statistically significant positive correlation (p < 0.05). In this figure, the x-axis represents Pearson’s correlation coefficients [49], while the y-axis denotes the number of sessions. Among the 21 sessions exhibiting positive correlation, approximately 8 sessions displayed weak correlation, 10 sessions exhibited moderate correlation, and 3 sessions were characterized by strong correlation. Additionally, two sessions revealed a negative correlation, and four sessions did not show any discernible relationship, which are not represented in the figure. This suggests that the models are capturing converging fatigue-related patterns, despite relying on fundamentally different input modalities and features.

Whereas Figure 5 highlights the correlation between the eye-tracking model and the new vision-based model, Figure 6 extends this comparison to the earlier vision-based model trained on a separate dataset (Section 3.2.1). These figures demonstrate the degree to which fatigue related patterns generalize across datasets and modalities. The analysis revealed that 19 out of 27 sessions showed a statistically significant positive correlation (p < 0.05). Among these, approximately 13 sessions demonstrated a weak correlation, while 6 exhibited a moderate correlation. In contrast, two sessions showed a negative correlation, and six sessions did not present any clear relationship.

These findings are significant in two respects:

Cross-modal agreement: It demonstrates that fatigue measured by patterns of eye movements is aligned with fatigue estimated from vital signs and human activity indicators, suggesting an underlying cognitive or physical state of fatigue common to both. This validates the vision-based physiological inference model.
Feasibility of low cost contactless fatigue estimation: The results support the practical feasibility of deploying fatigue detection models based on standard camera inputs. Given the consistency with a well-established eye-movement-based model, the second model could serve as a scalable surrogate or complement to traditional methods in contexts where sensors or advanced devices are impractical, expensive, or intrusive.

Figure 7 provides representative examples of the previously discussed results. Each subplot displays three prediction curves: one from the earlier vision-based model (trained on an entirely separate dataset), one from the developed vision model (trained on a limited portion of the new collected dataset), and one from the eye-movement-based model (also trained on the collected dataset). The blue line illustrates the fatigue predictions after applying a 5-min moving average, while the amber line represents a further smoothed version of the same signal to enhance visual clarity and presentation. To ensure that the correlation analysis reflects generalization rather than overlap, correlation coefficients were computed after excluding the time segments used for training the latter two models. As such, the visualized relationships reflect the models’ performance on unseen data.

A notable observation across the visualized results is the strong alignment between the predictions generated by the eye-movement-based model and the physiological indicator model trained on the same dataset. This alignment is particularly evident in Figure 7a–c, where fluctuations in predicted fatigue levels, especially the presence of peaks (high fatigue) and valleys (low fatigue), occur at approximately the same time points across both models. Although the magnitude of predicted fatigue may differ slightly, the temporal synchronization of changes suggests that both models are capturing similar underlying patterns of cognitive fatigue. This finding supports the earlier correlation analysis results, highlighting that fatigue inferred from ocular dynamics detected by special device is consistent with that estimated from vital signs and human activity indicators.

Furthermore, while the earlier vision-based model (trained on an entirely different dataset) often exhibits weaker correlation with the eye-tracking-data-based predictions, we can totally recognize the same case of the alignments of the peaks (high fatigue), and valleys (low fatigue) between both signals in most plots The reduction in correlation strength is likely attributable to differences in the experimental task demands between the datasets, as well as inter-individual variability in fatigue expression. These findings highlight the challenges of generalizing fatigue detection models across populations and contexts. Individual factors such as cognitive capacity and health status may all contribute to this variability, highlighting the importance of domain adaptation or fine-tuning when applying pre-trained models to new data environments.

Overall, these results demonstrate a stepwise validation of our approach from the baseline eye-tracking model, through the contactless vision-based model, to cross-modal correlation, which supports the feasibility of scalable mental fatigue detection.

5. Discussion

The experimental results presented in this study offer valuable insights into the feasibility and robustness of contactless fatigue detection systems based on eye movement features and physiological indicators extracted from facial videos. This section discusses the key findings from the perspective of model performance, cross-modal agreement, generalizability, and limitations.

5.1. Model Effectiveness and Cross-Validation Stability

The TabNet model trained on statistical features derived from eye movement data achieved consistent performance across cross-validation folds, with mean accuracy reaching 81% and maintaining a similar level (82%) in random splits. This level of stability is a crucial attribute in fatigue detection scenarios, where variations across sessions and individuals can challenge model generalization. The inherent interpretability and feature-wise attention mechanism of TabNet likely contributed to its robustness, allowing it to focus on the most informative ocular features, despite individual variability.

Similarly, the physiological indicator model trained on a limited subset of the data achieved an accuracy of 78% using Random Forests. This outcome supports the utility of visually derived vital signs and human activity indicators for modeling fatigue and suggests that even short training portions can capture discriminative features to represent fatigue states.

5.2. Cross-Modal Convergence and Correlation Insights

A key contribution of this study lies in the correlation-based analysis that compared predictions across different models and modalities. The session-wise Pearson correlation analysis demonstrated that the fatigue predictions estimated by eye movements are significantly aligned with those derived from physiological indicators in the majority of the sessions. Specifically, the eye movement model and the physiological indicator model showed statistically significant positive correlations in 21 out of 27 sessions, with moderate-to-strong correlation in nearly half of them. This finding supports the hypothesis that both modalities capture a shared underlying cognitive fatigue construct, despite relying on distinct signal types and feature spaces.

Moreover, the earlier vision-based model trained on an independent dataset also demonstrated significant correlation with the eye movement model in 19 out of 27 sessions with generally weaker correlation coefficients. This discrepancy is expected given the differences in task context and participant health conditions. Nevertheless, the fact that meaningful temporal alignment in fatigue prediction was still observed supports the possibility of partial generalization across datasets.

5.3. Limitations and Future Directions

While the results are encouraging, there are still a few important limitations to consider. One of the main challenges is the relatively small sample size, which makes it harder to draw strong conclusions or generalize the findings to a wider population. In some cases, the models aligned well, while in others, the agreement was noticeably weaker. To address this, future studies should consider techniques like personalized modeling, which could help tailor predictions to individual differences and improve overall reliability. Nevertheless, we have taken several measures to mitigate potential issues with the small data size. For example, we adopted a cross validation technique so that the model is always evaluated on unseen participants or sessions. This reduces the risk of bias toward individual-specific patterns. Additionally, many of the features used in our framework (e.g., eye movement indicators and vital signs) have well-established correlations with fatigue in the prior literature, which increases the confidence that the model is capturing generalizable physiological mechanisms rather than dataset artifacts. Moreover, we view this work as a proof-of-concept study demonstrating the feasibility of estimating fatigue from multimodal signals. While large-scale validation across diverse populations is an important next step, the present findings provide a valuable foundation for future research and practical deployment.

Second, the reduced correlation with the earlier vision-based model trained on a separate dataset highlights the sensitivity of fatigue estimation systems to contextual factors such as the task type. This highlights the importance of having more consistent experimental setups and taking a closer look at how well the features used by these models hold up when applied in different environments or tasks.

To better understand our results, it is important to compare our vision-based approach with EEG-based detection of fatigue, traditionally thought to be the gold standard for assessing cognitive states. EEG provides direct, millisecond-scale information concerning brain activity and achieves very high detection rates for mental fatigue. However, this accuracy is accompanied by strong weaknesses. EEG is highly intrusive, requires expert operators and special equipment, and is movement-sensitive, which limit its applicability to extended or everyday monitoring in tasks such as driving, factory work, or office environments. In contrast, our vision-based solution is slightly less accurate but has the major advantages of being completely contactless, scalable, and deployable using standard cameras webcams. This trade-off between peak accuracy and practical deployment ability is central to our methodological choice.

Additionally, although the physiological indicator model leverages vision-based features, which are more practical than data collected by special and expensive devices, it still operates under assumptions about consistent facial visibility and stable lighting. Real-world deployment would require robust pre-processing and possibly hybrid models to mitigate the impact of noise and occlusion. At the same time, several techniques exist that can help address these challenges in future work, such as robust preprocessing pipelines including illumination normalization and face tracking with landmark recovery, complementary sensing modalities to mitigate reliance on facial visibility, and model-level strategies such as training with augmented datasets that simulate occlusions and lighting variation.

Finally, to follow up on the current work, a key direction for the future is to introduce user experience perspectives into the analysis. As our models are focused on objective physiological and behavioral indicators of mental fatigue, integrating experiential and emotional measures, such as perceived workload, comfort, stress, or engagement, could provide a deeper understanding of how fatigue occurs and is experienced in real life. This would not only help validate the models against subjective perceptions but also guide the design of more user-centered monitoring systems.

6. Conclusions

This study investigated two models for mental fatigue detection: the eye-movement-based model utilizing statistical features extracted from eye movement data and the vision-based model that depends on vital signs and human activity indicators from facial video input. Both models achieved encouraging performance when trained and tested on temporally and contextually distinct partitions of the dataset. The eye-tracking model, which was implemented using TabNet, demonstrated steady and robust performance across participants, while the physiological indicator model using Random Forest demonstrated high accuracy even when it was trained on a small sample of the data.

A major contribution of this work lies in the correlation-based comparative analysis, which demonstrated meaningful alignment between the predictions of the two models. In a significant number of sessions, the models exhibited synchronized trends in fatigue level predictions that both models estimate. The correlation between the eye movement model and a earlier-developed vision-based model, trained on an entirely separate dataset, further supports the generalizability of fatigue-related signals extracted from facial videos

The findings highlight the feasibility of employing camera-only inputs to implement fatigue detection systems. Although the eye-tracking model suffers from the drawback of requiring special sensors and devices, the new vision-based model provides a scalable and non-intrusive alternative that can be implemented using standard video data. It is, therefore, applicable in scenarios where contact-based sensing is impractical or undesirable such as transportation and industrial safety.

From a scalability perspective, the proposed method can be considered complementary to sensor-based approaches. In future work, we envision multimodal integration, where video-based indicators are combined with signals from EEG, PPG, or IMUs to provide more robust fatigue estimation across diverse application contexts. Such integration would allow adaptive deployment depending on environmental constraints and available hardware. Therefore, while our current work focuses on video-based feasibility, we fully acknowledge the trade-offs and see significant potential for hybrid frameworks that leverage both visual and wearable sensing modalities for scalable fatigue detection.

In conclusion, this paper demonstrates that both eye-movement-based and vision-based models can reliably predict mental fatigue and whose outputs show near convergence on unseen data. These results lay the groundwork for future development of interpretable, adaptable, and scalable fatigue monitoring systems suitable for deployment in real-world environments.

Author Contributions

Conceptualization, N.S. and A.K.; methodology, N.S. and A.K.; software, B.H. and W.O.; validation, N.S. and A.K.; formal analysis, N.S. and A.K.; investigation, B.H. and W.O.; data curation, B.H. and W.O.; writing—original draft preparation, B.H. and W.O.; writing—review and editing, N.S. and A.K.; visualization, B.H. and W.O.; supervision, N.S. and A.K.; project administration, N.S. and A.K.; funding acquisition, N.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the Russian State Research grant FFZF-2025-0003.

Institutional Review Board Statement

Approval for the study was granted and monitored by the local ethics committee of the St. Petersburg Federal Research Center of the Russian Academy of Sciences. The research was conducted in accordance with the 2013 version of the Helsinki Declaration by the World Medical Association. The Ethics Committee of the St. Petersburg Federal Research Center of the Russian Academy of Sciences reviewed the paper (protocol 7, 26 June 2025).

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, G.; Yau, K.K.; Zhang, X.; Li, Y. Traffic accidents involving fatigue driving and their extent of casualties. Accid. Anal. Prev. 2016, 87, 34–42. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Lodewijks, G. Detecting fatigue in car drivers and aircraft pilots by using non-invasive measures: The value of differentiation of sleepiness and mental fatigue. J. Saf. Res. 2020, 72, 173–187. [Google Scholar] [CrossRef] [PubMed]
Dawson, D.; Searle, A.K.; Paterson, J.L. Look before you (s)leep: Evaluating the use of fatigue detection technologies within a fatigue risk management system for the road transport industry. Sleep Med. Rev. 2014, 18, 141–152. [Google Scholar] [CrossRef] [PubMed]
Gomer, J.; Walker, A.; Gilles, F.; Duchowski, A. Eye-Tracking in a Dual-Task Design: Investigating Eye-Movements, Mental Workload, and Performance. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 2008, 52, 1589–1593. [Google Scholar] [CrossRef]
Revanur, A.; Dasari, A.; Tucker, C.S.; Jeni, L.A. Instantaneous Physiological Estimation using Video Transformers. arXiv 2022, arXiv:2202.12368. [Google Scholar] [CrossRef]
Jain, M.; Deb, S.; Subramanyam, A.V. Face video based touchless blood pressure and heart rate estimation. In Proceedings of the 2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP), Montreal, QC, Canada, 21–23 September 2016; pp. 1–5. [Google Scholar] [CrossRef]
Othman, W.; Hamoud, B.; Shilov, N.; Kashevnik, A. Human Operator Mental Fatigue Assessment Based on Video: ML-Driven Approach and Its Application to HFAVD Dataset. Appl. Sci. 2024, 14, 510. [Google Scholar] [CrossRef]
He, X.; Li, S.; Zhang, H.; Chen, C.; Li, J.; Dragomir, A.; Bezerianos, A.; Wang, H. Towards a nuanced classification of mental fatigue: A comprehensive review of detection techniques and prospective research. Biomed. Signal Process. Control 2026, 111, 108496. [Google Scholar] [CrossRef]
Wang, L.; Li, H.; Yao, Y.; Han, D.; Yu, C.; Lyu, W.; Wu, H. Smart cushion-based non-invasive mental fatigue assessment of construction equipment operators: A feasible study. Adv. Eng. Inform. 2023, 58, 102134. [Google Scholar] [CrossRef]
Hart, S.G.; Staveland, L.E. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. In Human Mental Workload; Advances in Psychology; Hancock, P.A., Meshkati, N., Eds.; Elsevier: Amsterdam, The Netherlands, 1988; Volume 52, pp. 139–183. [Google Scholar] [CrossRef]
Luo, H.; Lee, P.A.; Clay, I.; Jaggi, M.; De Luca, V. Assessment of Fatigue Using Wearable Sensors: A Pilot Study. Digit. Biomark. 2020, 4, 59–72. [Google Scholar] [CrossRef]
Meng, J.; Zhao, B.; Ma, Y.; Ji, Y.; Nie, B. Effects of fatigue on the physiological parameters of labor employees. Nat. Hazards 2014, 74, 1127–1140. [Google Scholar] [CrossRef]
Mehmood, I.; Li, H.; Qarout, Y.; Umer, W.; Anwer, S.; Wu, H.; Hussain, M.; Fordjour Antwi-Afari, M. Deep learning-based construction equipment operators’ mental fatigue classification using wearable EEG sensor data. Adv. Eng. Inform. 2023, 56, 101978. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Khishdari, A.; Mirzahossein, H. Unveiling driver drowsiness: A probabilistic machine learning approach using EEG and heart rate data. Innov. Infrastruct. Solut. 2025, 10, 275. [Google Scholar] [CrossRef]
Anitha, C.; Venkatesha, M.; Adiga, B.S. A Two Fold Expert System for Yawning Detection. Procedia Comput. Sci. 2016, 92, 63–71. [Google Scholar] [CrossRef][Green Version]
Hussain, A.; Saharil, F.; Mokri, R.; Majlis, B. On the use of MEMs accelerometer to detect fatigue department. In Proceedings of the 2004 IEEE International Conference on Semiconductor Electronics, Kuala Lumpur, Malaysia, 7–9 December 2004; p. 5. [Google Scholar] [CrossRef]
Melo, H.; Nascimento, L.; Takase, E. Mental Fatigue and Heart Rate Variability (HRV): The Time-on-Task Effect. Psychol. Neurosci. 2017, 10, 428–436. [Google Scholar] [CrossRef]
Csathó, Á.; Van der Linden, D.; Matuz, A. Change in heart rate variability with increasing time-on-task as a marker for mental fatigue: A systematic review. Biol. Psychol. 2024, 185, 108727. [Google Scholar] [CrossRef]
Matuz, A.; van der Linden, D.; Kisander, Z.; Hernádi, I.; Kázmér, K.; Csathó, Á. Low Physiological Arousal in Mental Fatigue: Analysis of Heart Rate Variability during Time-on-task, Recovery, and Reactivity. bioRxiv 2020. [Google Scholar] [CrossRef]
Kashevnik, A.; Shchedrin, R.; Kaiser, C.; Stocker, A. Driver Distraction Detection Methods: A Literature Review and Framework. IEEE Access 2021, 9, 60063–60076. [Google Scholar] [CrossRef]
Zhao, Q.; Nie, B.; Bian, T.; Ma, X.; Sha, L.; Wang, K.; Meng, J. Experimental study on eye movement characteristics of fatigue of selected college students. Res. Sq. 2023. [Google Scholar] [CrossRef]
Sampei, K.; Ogawa, M.; Torres, C.C.C.; Sato, M.; Miki, N. Mental Fatigue Monitoring Using a Wearable Transparent Eye Detection System. Micromachines 2016, 7, 20. [Google Scholar] [CrossRef]
Cui, Z.; Sun, H.M.; Yin, R.N.; Gao, L.; Sun, H.B.; Jia, R.S. Real-time detection method of driver fatigue state based on deep learning of face video. Multimed. Tools Appl. 2021, 80, 25495–25515. [Google Scholar] [CrossRef]
Dua, M.; Shakshi; Singla, R.; Raj, S.; Jangra, A. Deep CNN models-based ensemble approach to driver drowsiness detection. Neural Comput. Appl. 2021, 33, 3155–3168. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’12, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 1, pp. 1097–1105. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Ye, M.; Zhang, W.; Cao, P.; Liu, K. Driver Fatigue Detection Based on Residual Channel Attention Network and Head Pose Estimation. Appl. Sci. 2021, 11, 9195. [Google Scholar] [CrossRef]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. arXiv 2018, arXiv:1807.02758. [Google Scholar] [CrossRef]
Zhan, T.; Xu, C.; Zhang, C.; Zhu, K. Generalized Maximum Likelihood Estimation for Perspective-n-Point Problem. arXiv 2024, arXiv:2408.01945. [Google Scholar] [CrossRef]
Dey, S.; Chowdhury, S.A.; Sultana, S.; Hossain, M.A.; Dey, M.; Das, S.K. Real Time Driver Fatigue Detection Based on Facial Behaviour along with Machine Learning Approaches. In Proceedings of the 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), Dhaka, Bangladesh, 28–30 November 2019; pp. 135–140. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Mehmood, I.; Li, H.; Umer, W.; Arsalan, A.; Anwer, S.; Mirza, M.A.; Ma, J.; Antwi-Afari, M.F. Multimodal integration for data-driven classification of mental fatigue during construction equipment operations: Incorporating electroencephalography, electrodermal activity, and video signals. Dev. Built Environ. 2023, 15, 100198. [Google Scholar] [CrossRef]
Hamoud, B.; Othman, W.; Shilov, N. Analysis of Computer Vision-Based Physiological Indicators for Operator Fatigue Detection. In Proceedings of the 2025 37th Conference of Open Innovations Association (FRUCT), Narvik, Norway, 14–16 May 2025; pp. 47–58. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression; Wiley: Hoboken, NJ, USA, 1989. [Google Scholar]
Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice Hall PTR: Hoboken, NJ, USA, 1994. [Google Scholar]
Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Krishnapuram, B., Shah, M., Smola, A.J., Aggarwal, C., Shen, D., Rastogi, R., Eds.; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Liu, P.; Lyu, M.; King, I.; Xu, J. SelFlow: Self-Supervised Learning of Optical Flow. arXiv 2019, arXiv:1904.09117. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Guo, J.; Zhu, X.; Yang, Y.; Yang, F.; Lei, Z.; Li, S.Z. Towards Fast, Accurate and Stable 3D Dense Face Alignment. arXiv 2020, arXiv:2009.09960. [Google Scholar] [CrossRef]
Khokhlov, I.; Davydenko, E.; Osokin, I.; Ryakin, I.; Babaev, A.; Litvinenko, V.; Gorbachev, R. Tiny-YOLO object detection supplemented with geometrical data. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), Antwerp, Belgium, 25–28 May 2020; pp. 1–5. [Google Scholar] [CrossRef]
Arik, S.Ö.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. arXiv 2019, arXiv:1908.07442. [Google Scholar] [CrossRef]
Kovalenko, S.; Mamonov, A.; Kuznetsov, V.; Bulygin, A.; Shoshina, I.; Brak, I.; Kashevnik, A. OperatorEYEVP: Operator Dataset for Fatigue Detection Based on Eye Movements, Heart Rate Data, and Video Information. Sensors 2023, 23, 6197. [Google Scholar] [CrossRef]
Kashevnik, A.; Kovalenko, S.; Mamonov, A.; Hamoud, B.; Bulygin, A.; Kuznetsov, V.; Shoshina, I.; Brak, I.; Kiselev, G. Intelligent Human Operator Mental Fatigue Assessment Method Based on Gaze Movement Monitoring. Sensors 2024, 24, 6805. [Google Scholar] [CrossRef] [PubMed]
Sedgwick, P. Pearson’s correlation coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]

Figure 1. Overall research structure.

Figure 2. Fatigue detection model proposed in [35].

Figure 3. Eye-tracking-based method.

Figure 4. New vision-based method.

Figure 5. Correlation analysis between the predictions of the eye-movement-based model and vision-based model.

Figure 6. Correlation analysis between the predictions of the eye-movement-based model and the earlier vision-based model.

Figure 7. Examples of predictions made by the models proposed, subfigures (a–d) presented the calculated graphs for different participants.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hamoud, B.; Othman, W.; Shilov, N.; Kashevnik, A. Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment. Electronics 2025, 14, 3789. https://doi.org/10.3390/electronics14193789

AMA Style

Hamoud B, Othman W, Shilov N, Kashevnik A. Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment. Electronics. 2025; 14(19):3789. https://doi.org/10.3390/electronics14193789

Chicago/Turabian Style

Hamoud, Batol, Walaa Othman, Nikolay Shilov, and Alexey Kashevnik. 2025. "Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment" Electronics 14, no. 19: 3789. https://doi.org/10.3390/electronics14193789

APA Style

Hamoud, B., Othman, W., Shilov, N., & Kashevnik, A. (2025). Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment. Electronics, 14(19), 3789. https://doi.org/10.3390/electronics14193789

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep-Learning-Based Human Activity Recognition: Eye-Tracking and Video Data for Mental Fatigue Assessment

Abstract

1. Introduction

2. Literature Review

2.1. Fatigue Estimation via Sensors and Advanced-Device Data

2.2. Fatigue Estimation via Features Estimated Using Computer Vision and Deep Learning

3. Methodology

3.1. Proposed Models for Fatigue Assessment

3.2. Earlier-Developed Vision-Based Method

3.2.1. Mental Fatigue Estimation Method Using Vital Signs and Human Activity

3.2.2. Deep Learning Models for Physiological Indicator Estimation

3.3. Proposed Methods

3.3.1. Eye-Tracking-Based Method

3.3.2. New Vision-Based Method

3.4. Datasets

3.4.1. OperatorEYEVP Dataset

3.4.2. Collected Dataset

4. Experiments and Results

4.1. Fatigue Detection Using Eye Movement Characteristics

4.2. Vision-Based Fatigue Detection Using Physiological Indicators

4.3. Correlation Analysis for Validity Check

5. Discussion

5.1. Model Effectiveness and Cross-Validation Stability

5.2. Cross-Modal Convergence and Correlation Insights

5.3. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI