Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers

Nurnoby, M. Faisal; El-Alfy, El-Sayed M.

doi:10.3390/app16073353

Open AccessArticle

Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers

by

M. Faisal Nurnoby

^1,*

and

El-Sayed M. El-Alfy

^1,2,3,*

¹

Information and Computer Science Department, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 34464, Saudi Arabia

²

Computer Engineering Department, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 34464, Saudi Arabia

³

Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 34464, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3353; https://doi.org/10.3390/app16073353

Submission received: 18 February 2026 / Revised: 19 March 2026 / Accepted: 24 March 2026 / Published: 30 March 2026

Download

Browse Figures

Versions Notes

Abstract

Given its substantial contribution to traffic accidents, one of the main goals of intelligent driver-assistance systems has become the detection and mitigation of driver fatigue to enhance driving safety and comfort. Among various approaches, vision-based facial analysis using deep learning has emerged as an effective and non-intrusive method for identifying driver drowsiness, as a key manifestation of fatigue. However, current drowsiness detection models do not account for demographic factors like gender, even though recent research has shown gender behavioral differences such as eye closure duration, blink frequency, yawning patterns, and facial muscle relaxation. In this paper, we present a fine-grained multi-stream transformer architecture that incorporates gender-awareness and shifted-windows attention for spatial feature fusion. Integrating gender embedding, by modulating the region-based features, allows the model to effectively learn gender-conditioned drowsiness features to minimize bias and diluted representations. Using the NTHU-DDD dataset, we evaluated two-stream and three-stream variants for gender-aware and gender-agnostic across three facial region contexts: the face region with a 20% margin, bare face region, and key facial regions (face, eyes, and mouth). A comprehensive ablation study was conducted to identify the most effective model setup. The results demonstrate that incorporating gender embedding improves detection performance, achieving an accuracy of 95.47% on the evaluation set. Moreover, using the proposed three-stream model (SWT-DD-3S) produced better results.

Keywords:

intelligent transportation systems (ITS); driver drowsiness detection (DDD); gender-aware; Swin Transformer

1. Introduction

Driver drowsiness detection has gained considerable research attention in recent years due to its broad applicability in several areas including transportation and workplace safety. Drowsy driving remains one of the leading causes of traffic accidents, injuries, and property damage worldwide according to recent reports. For instance, the Governors Highway Safety Association (GHSA) has recently published that “more than 6300 people died in suspected drowsy driving crashed in 2023” (https://www.ghsa.org/resource-hub/wake-up-call-update (accessed on 22 March 2026)), which is ten times higher than the 633 deaths reported in 2023 in U.S. federal statistics. Comprehensive national injury estimates are not produced each year; the latest from NHTSA (2017) indicates approximately 91,000 crashes and 50,000 injuries [1]. Reliable drowsiness detection systems can alert drivers early when they start to feel sleepy and can also track the driver’s alertness and help control the vehicle, thus reducing the chance of accidents. Therefore, research on driver drowsiness detection plays an important role in advancing road safety and reducing drowsiness-related incidents toward the development of next-generation intelligent transportation systems.

Several detection techniques [2,3,4,5,6,7] have been recently proposed in this field. Among these methods, vision-based facial analysis has become a useful, non-intrusive solution. The visual symptoms of drowsiness are usually shown through behavioral cues such as slow blinking, extended eye closure, frequent yawning, and head nodding. Three phases typically sum up the current vision-based drowsiness recognition techniques: face detection, extraction of facial features, and determination of fatigue state [8]. Driver’s face detection techniques include multi-task convolutional neural networks (MTCNNs) [9], RetinaFace [10], spatial pyramid pooling, and multi-scale feature extraction. Facial keypoints are usually extracted to identify signs of drowsiness, namely, blinking and yawning. Common methods for this task include facial landmark detectors (PFLDs) [11], DLIB library, and multi-scale facial landmark detection [12]. In this regard, vision-based deep learning methods, particularly popular image-based convolutional neural network (CNN) families such as VGG, ResNet, DenseNet, Inception, and EfficientNet, have achieved strong accuracy on benchmark datasets [2,6,13].

Detecting driver drowsiness from real-time or continuous video streams poses an even greater challenge due to complex driving conditions, unpredictable driver movements, occlusions, motion blur, and variations in lighting, head pose, and facial expressions. To handle these issues, advanced learning architectures are essential to detect driver drowsiness more accurately and reliably across diverse conditions. Employing CNN-based methods turns out to be less effective as convolutions operate with fixed local receptive fields; even deep stacks can miss faint, spatially separated drowsiness cues such as the joint pattern of eyelid droop and mouth opening. As drowsiness develops progressively over time, effective modeling requires the capture of long-range dependencies and smoothing out momentary noise (e.g., short-duration blinks). Transformer-based models overcome this by using self-attention to capture global relationships without recurrence. They also help in parallel processing and more efficient learning of long-distance interactions [14]. Within this family, Swin Transformer [15] is well suited to analyze facial clues and detect drowsiness. This architecture builds a hierarchical representation via non-overlapping local windows. It periodically shifts those windows to allow information flow across neighboring windows. The design is computationally efficient, and it connects subtle, spatially distributed signals across the face. Unlike global self-attention models (e.g., Vision Transformer (ViT) [16]), Swin’s windowed-shifted attention scales to higher resolutions. It balances better local detail with global context; therefore, it can detect nuanced drowsiness indicators. Also, approaches that use full frames require huge computational resources and are easily affected by redundant or noisy information, while drowsiness cues are mostly concentrated in the face, eyes, and mouth regions.

Motivated by previous insights, we propose a multi-stream Swin Transformer-based model for drowsiness detection (SWT-DD-2S and 3S). The architecture captures fine-grained local patterns and integrates information from eye, mouth, and face regions within sampled video frames. Moreover, unlike earlier driver drowsiness models that overlook demographic factors such as gender, and treat all drivers the same way, our proposed architecture incorporates gender embeddings into its architecture, as male and female drivers may exhibit drowsiness differently [17]. Studies on related domains have also revealed that integrating demographic information can improve prediction accuracy and reduce bias [18]. In summary, the main contributions of the paper are as follows:

We propose a gender-aware multi-stream transformer with shifted windows that fuses eye and mouth encoders to learn discriminative features for drowsiness detection.
We evaluate a diverse set of deep learning techniques for gender-aware and gender-agnostic driver drowsiness detection.
We optimize a proposed two-stream approach based on comprehensive ablation experiments to determine the best configuration settings for pretrained and non-pretrained variants for robust drowsiness detection.
We validate and compare the proposed model through extensive experiments, providing empirical evidence of gender influence and bias on vision-based drowsiness detection.

The rest of this paper is structured as follows. Section 2 briefly reviews the related research. The proposed models and methodology are presented in detail in Section 3. The experimental results and analysis are given in Section 4. Section 6 summarizes the paper.

2. Related Work

Driver drowsiness detection has made significant advancements in recent years. This section reviews the most relevant vision-based approaches employing convolutional or transformer architectures. A summary of notable models is shown in Table 1.

2.1. Convolutional-Based Feature Representations

This category of models is widely used for detecting driver drowsiness using visual facial feature representations. These models can be subdivided into two main subcategories: Euclidean (grid-based) convolutional architectures, such as CNNs, or non-Euclidean (graph-based) convolutional architectures, such as graph convolutional networks (GCNs). One of the early contributions on deep learning using CNNs for driver drowsiness detection is DriCare, introduced in [19]. This system detects drowsiness symptoms such as yawning, blinking, and eye closure by applying multiple CNNs with Kernelized correlation filters. The detection system uses 68 keypoints in the facial region for real-time application. The model showed high tracking stability, but its performance decreases in low light or when the driver wears glasses. A similar work in [25] presented a comparison of multiple machine learning and deep learning models across three public datasets. Among the considered techniques are K-nearest neighbors (KNN), support vector machines (SVMs) and CNNs. However, most of these models rely on traditional machine learning or earlier CNN designs and do not include the improvements found in recent architectures.

To capture both spatial and temporal features, Guo and Markoni proposed a hybrid model of CNN and long short-term memory (LSTM) [20]. This model was evaluated on the ACCV 2016 drowsy driver dataset, which has short videos recorded under controlled conditions and limited subject diversity, and the best accuracy achieved after refinement was 84.85%. In [2], another CNN-based system was presented for real-time eye closure detection. Three classification models were evaluated using a Fully Designed Neural Network (FD-NN) and two transfer learning (TL) models based on VGG16 and VGG19 (TL-VGG). However, the paper focused on eye closure, ignoring other cues of drowsiness such head nodding and yawning. Additionally, the models depend on pretrained weights with limited datasets. Another work in [30] applied MobileNet with transfer learning for real-time eyelid state classification on an MRL dataset. In [31], a driver drowsiness detection model was presented that combines a generative adversarial network (GAN) with YOLO. The GAN creates synthetic images of drowsy faces, which are mixed with real data to train the YOLO model. This combination improves real-time detection and helps autonomous vehicles navigate safely. However, the GAN-generated images may not fully represent real-world conditions, which can lead to bias or overfitting. Another real-time ensemble-based deep learning system was proposed in [3]. Four CNN models were combined: AlexNet, VGG-FaceNet, FlowImageNet, and ResNet; each specialized in capturing different features, including environmental context, facial expressions, behavioral cues, and hand gestures. Thampi et al. [29] compared CNN-based detection with a computer-vision pipeline using Dlib/MediaPipe and OpenCV for eye/mouth/head monitoring; their results indicate that the CV-based approach can outperform CNN baselines in their setting, highlighting that robustness sometimes comes from stable geometric cues rather than deeper networks.

Graph-based deep models have been found to be effective in driver fatigue detection. The article [32] proposed a graph-based convolutional model. The model is multi-aware through class-attention-aware and composite extractors. In [33], the authors combined GCN with LSTM to detect driver drowsiness. However, it used only a limited number of samples available in the dataset with no clear discussion of potential overfitting. Another attention-based GCN was proposed in [22] for driver fatigue by encoding facial skeleton. The work requires further investigation in varied real-world driving conditions.

2.2. Transformer-Based Methods

Since its introduction in 2020, Vision Transformer (ViT) has become increasingly preferable over traditional CNN-based models for image classification tasks [15,16]. It was inspired by the success of transformer-based architectures for natural language processing (NLP), which all started with the creative idea ‘Attention is all you need’ in 2017 [34]. Adapting transformers from language to vision presents challenges as visual data shows greater variability in the size of objects and much higher resolution. Researchers have been working to address these issues. However, a transformer-based architecture has several advantages as well. ViTs can model global context using self-attention from the very first layer, whereas CNNs rely on stacked local operations and slow growth of receptive field [15]. This global reasoning helps ViTs outperform CNN-based models where tasks require holistic understanding in complex background or occlusion. ViTs offer a simple modular architecture without relying on strong inductive biases like translation equivariance or locality, which is common for CNNs.

Multiple studies have employed ViT-based models for driver drowsiness detection. The study in [26] proposed a real-time vision-transformer framework for binary eye state classification. The system integrates Haar cascade classifiers, data augmentation, and a drowsiness scoring mechanism. Experiments demonstrated that transformer models outperform traditional CNN-based approaches in both accuracy and generalization. However, it relies heavily on static eye-state classification, ignoring other cues such as mouth state. Another article [35] applied a pretrained ViT-based model, trained on ImageNet1K-v1, to analyze facial images extracted from a private driver behavior dataset. However, the model has less deployment feasibility on embedded systems. This limitation was addressed in [36] with Vis-Net, a hybrid architecture combining ViT and CNNs. To optimize recognition, Vis-Net integrates emotion detection. The approach also considers real-world scenarios such as low-light and masked conditions.

As outlined in [16], a transformer model requires a pretraining stage on a large amount of curated data to achieve good performance over a CNN-based model. To address this issue, another teacher–student-specific transformer architecture, dubbed DeiT [37], was proposed. The model significantly improves data efficiency by relying on a distillation token where a CNN acts as a teacher that teaches a transformer via attention. In short, a teacher transfers inductive biases to a transformer without explicitly encoding them. There are many other ways one can inject more types of biases without losing the transformer architecture. However, both ViT and DeiT treat patch embedding as independent units, thus ignoring local continuity. Also, simple tokenization may fail to encode important local structures. Hence, another variation in this group, dubbed T2T-ViT, was proposed in [38]. This model overcomes the aforementioned issues by redefining the tokenization process. This model achieves better performance while reducing parameters and MACs of vanilla ViT by half. In T2T-ViT, a transformer layer’s output tokens are reconstructed as an image. The image is then split into overlapping tokens. The adjacent tokens are then joined together by flattening the patches. This way, information from surrounding patches is captured and embedded into tokens before they are fed to the following transformer layer. Though T2T-ViT showed better generalization on smaller datasets, it still lacked an inherent hierarchical feature learning mechanism, which is common in convolutional architectures.

Swin-Transformer [15] was introduced to address the limitations of global self-attention by computing hierarchical feature representations with shifted windows. This design enhanced computational efficiency as self-attention was confined to non-overlapping local windows. Moreover, by slightly shifting these windows between layers, the model effectively captured cross-window dependencies, enabling it to learn contextual relationships among different image regions. Rahmani et al. [28] proposed a semi-supervised framework combining YOLOv8 for robust face detection with a Swin Transformer. Building on the strengths of the Swin Transformer, the study in [23] proposed a framework named FPIRST, a residual Swin-T. The model generated parameter matrices of drowsiness from eye and mouth aspect ratios, then encoded them into images. The experiments showed the approach performed better than traditional CNN-based systems. However, the approach is computationally costly, and the dataset used lacks robustness and diversity. The literature review suggests that Swin-T effectively overcomes the challenges of other ViT-based architectures and the limitations of CNN-based models. Consequently, our study in this paper adopts Swin-T as the base unit in our framework. The proposed architecture employs a three-stream design, focusing separately on three key facial regions. Additionally, it incorporates gender data, a feature that has not been explored in previous vision-based studies.

3. Proposed Methodology

This study presents a gender-aware framework, as illustrated in Figure 1, for video-based driver drowsiness detection using advanced deep learning techniques.

The proposed approach aims to overcome the shortcomings of current techniques by adding improved data preprocessing steps, transformer-based models, and transfer-learning strategies that account for gender-specific behavioral and visual differences. The process begins with video preprocessing, where driving footage is segmented into clips and key frames are extracted for each clip. Subsequently, each frame is analyzed to delineate facial, eye, and mouth regions. This is followed by image resizing, normalization, and data augmentation to improve the model’s robustness and generalization capability. The framework initially explores a range of deep learning architectures, including ViT, DeiT, and ResNet50V2. These architectures are trained and evaluated under both gender-aware and gender-agnostic conditions, using standard performance metrics such as accuracy, precision, recall, and F1-score.

Finally, this study explores a gender-aware three-stream Swin Transformer model for drowsiness detection (SWT-DD-3S), with a focus on three distinct facial regions (eyes, mouth, and face crops). The proposed architecture is illustrated in Figure 2. It employs three separate transformer encoders, dedicated to eye, mouth, and face regions. Gender is incorporated during the final feature fusion stage to enable the model to learn gender-specific variations in drowsiness cues. Furthermore, several ablation experiments are conducted to determine the optimal model configuration. Detailed descriptions of the model components and each stage of the overall system architecture are provided in the following subsections.

3.1. Extraction of Drowsy Facial Regions

Video frames are preprocessed to extract important regions focusing specifically on the face, eyes, and mouth regions, which are critical in identifying signs of drowsiness. MTCNN/MTCNN++ [39] or RetinaFace [10] can be used for this purpose, but in our study, we employed RetinaFace, which is a multi-level face localization tool with higher detection accuracy and greater robustness. It integrates face detection, 2D facial landmark localization, and 3D vertex regression into one efficient unified framework. Unlike MTCNN, which employs separate stages for detection and landmark prediction, RetinaFace performs these tasks simultaneously in a single pass using a feature pyramid network. This integration reduces false positives and stabilizes predictions and, thus, improves the overall face detection quality. The tool can perform robustly, especially under challenging conditions such as varying illumination and head poses. This makes it more suitable for accurately cropping both frontal and non-frontal driver faces from video frames. The resulting cropped images are organized into structured directories based on dataset splits, gender (male, female), and alertness states (drowsy, non-drowsy).

We carried out three stages of cropping using RetinaFace: (1) face with a 20% enlarged margin, (2) face region without any margin, and (3) separate crops of both eyes and mouth (Figure 3). For the third step, a paired dataset class is developed to simultaneously load corresponding eye and mouth images for each subject. For all three states, the driver images are standardized to a certain resolution, normalized using ImageNet-derived mean and standard deviation values, and batched for efficient GPU training. We addressed the data imbalance issue in two ways: by assigning different frame sampling rates for different classes during frame extraction, and by applying data augmentation techniques.

3.2. Extraction of Spatial Features

The proposed model encodes features from diverse facial regions of the driver. Face crops (both with and without margins) are processed by a single-stream network. In contrast, for the separately cropped eyes and mouth regions, we use a two-stream model. In this architecture, one channel processes the eyes and the other processes the mouth. Finally, we integrate eye, mouth, and face encoders to form a three-stream model (SWT-DD-3S). Each stream contains four Swin stages, following the standard Swin-T architecture with overlapping shifted windows.

Unlike traditional transformers that use global attention mechanisms, Swin Transformers compute self-attention within local windows, which significantly reduces computational complexity. Also, the windows are periodically shifted to enhance the interaction between neighboring windows, which helps capture fine-grained local details and broader contextual information. The first configuration partitions the

8 \times 8

feature maps into evenly spaced

4 \times 4

windows starting from the top-left corner. The next block shifts the window positions by half the window size to create overlapping partitions. A LayerNorm (LN) is used before both the multi-head self-attention (MSA) and the MLP, followed by a residual connection after each module (Equation (1)). These alternating windowing strategies are used across consecutive transformer blocks for enhanced feature learning.

\begin{matrix} {\hat{y}}^{b} & = W - MSA (LN (y^{b - 1})) + y^{b - 1}, \\ y^{b} & = MLP (LN ({\hat{y}}^{b})) + {\hat{y}}^{b}, \\ {\hat{y}}^{b + 1} & = SW - MSA (LN (y^{b})) + y^{b}, \\ y^{b + 1} & = MLP (LN ({\hat{y}}^{b + 1})) + {\hat{y}}^{b + 1} \end{matrix}

(1)

where

{\hat{y}}^{b}

and

y^{b}

represent the output features of the window-based MSA andMLP modules of block b, respectively, and

{\hat{y}}^{b + 1}

and

y^{b + 1}

are similar but for shifted window-MSA and MLP of block

b + 1

.

3.3. Gender-Aware Fusion of Facial Features for Classification

In our model, each Swin stage involved several Swin blocks that combined shifted window self-attention, layer normalization, and multilayer perceptron (MLP) with GELU activations. We used a single-stream architecture for the drowsiness face region with margin, and face crops without margin. In contrast, for the separately cropped eyes and mouth regions, we employed a two-stream Swin-T backbone with separate encoders for each region. These backbones independently processed eyes, mouth, and face regions to generate 768 dimension pooled features for each stream, as shown in Equation (2).

\begin{matrix} F_{e} \in R^{768}, F_{m} \in R^{768}, F_{f} \in R^{768} \end{matrix}

(2)

Each backbone employs sequential Swin stages with multiple shifted window attention blocks to capture both local and global context. Gender information, represented as 0 for male and 1 for female, is also added into our model using Feature-wise Linear Modulation (FiLM) [40]. We also initially tried gender-wise attention bias, but the results were not good. Through FiLM, the region-based embeddings are modulated by affine transform (Equation (3)):

\begin{matrix} x^{'} = γ ⊙ x + β, \end{matrix}

(3)

where x refers to region-based features (

F_{e}

,

F_{m}

, or

F_{f}

), and

γ

and

β

are learnable parameters using the region-based gender-aware FiLM generator model. In our study, the generator model is a 1-hidden-layer MLP that takes gender embedding as input and generates the scale and shift parameters for each region type (Equation (4)):

\begin{matrix} [γ_{e}, β_{e}, γ_{m}, β_{m}, γ_{f}, β_{f}] = f_{FiLM} (E_{g}), γ_{★}, β_{★} \in R^{768}, * \in {e, m, f} . \end{matrix}

(4)

FiLM was applied after feature extraction outside the encoder, using identity-safe parametrization. The weights and biases in the final linear layer were zero-initialized. This avoids early instability and ensures FiLM only steps where the optimizer finds signal in the conditioning variable. The modulated features were then concatenated (Figure 2), and LayerNorm was applied before a linear classifier.

4. Results and Discussion

We conducted several experiments on the NTHU-DDD dataset. First, we compared the proposed SWT-DD-3S architecture with the existing state-of-the-art image classification models. We did so across three cropped-face regions: (1) eyes, mouth, and face, (2) faces with a 20% margin, and (3) faces without any margin. Lastly, we present an ablation study on the key design components of SWT-DD-3S architecture.

4.1. NTHU-DDD Dataset

This dataset [41] is a widely used benchmark for evaluating driver drowsiness detection methods. It was developed in a controlled lab environment. The dataset is challenging and has only 36 participants with balanced gender information (18 male and 18 female) who performed actions (such as slow blinking rate, frequent nodding, yawning, falling asleep, talking, laughing, and head movements), representing both drowsy and alert driving conditions. Videos were recorded using a D-Link DCS-932L surveillance camera (D-Link, Taipei, Taiwan) with built-in infrared (IR) LEDs at a resolution of

640 \times 480

pixels. This camera was used for both daytime and nighttime lighting conditions. In addition, a Logitech C310 HD RGB webcam (720 p, 30 fps) (Logitech, Lausanne, Switzerland) was used to capture color videos during daytime. Each video was captured at 15 fps (night) and 30 fps (day). Example frames from the NTHU-DDD dataset are shown in Figure 4.

4.2. Implementation Details

A gender-aware drowsiness image dataset was constructed using frames extracted from the NTHU-DDD video clips. The NTHU-DDD dataset contains an equal number of male and female subjects. Each participant’s gender was recorded during data collection, and all samples from that subject were assigned the corresponding gender label. The videos were then split into individual frames and grouped according to the subject’s gender. Each frame was further labeled as drowsy or alert using the frame-level annotations provided by the dataset. For data preparation, every fifth frame was extracted based on these drowsiness labels. The dataset has RGB and IR images categorized by gender (male or female) and alertness state (drowsy or alert). We extracted every 5th frame from the NTHU videos based on drowsiness labels. Each successfully extracted frame was reduced to

224 \times 224

pixels and normalized for input into each model. The encoded labels for gender are male (0) and female (1), and for drowsiness state, alert (0) and drowsy (1). The training dataset with 18 subjects was split into training (80%) and validation (20%). For testing, frames from the four subjects of evaluation data were used. The model was trained for up to 100 epochs with early stopping based on validation loss improvements (

p a t i e n c e

of 5 epochs). The cross-entropy loss function guided the training process using the

A d a m W

optimizer with a learning rate of 0.0001 and a batch size of 32.

4.3. Evaluation Metrics

To evaluate the drowsiness detection model, we mainly used accuracy, precision, recall, and F1-score. The precision and recall were calculated from the confusion matrix (Table 2). Equation (5) shows the calculation of precision (Prc), recall (Rec),

F_{1}

and accuracy (Acc).

\begin{matrix} Prc & = \frac{T P}{T P + F P}, & Rec & = \frac{T P}{T P + F N}, \\ F_{1} & = 2 \times \frac{Prc \times Rec}{Prc + Rec}, & Acc & = \frac{T P + T N}{T P + F P + T N + F N} \end{matrix}

(5)

where these metrics are reported in this paper as percentage.

4.4. Evaluation of Single-Stream Models on Cropped Faces with and Without Margins

In this set of experiments, we evaluated four single-stream models: ResNet50-SS, ViT-SS, DeiT-SS and SWT-DD-SS. Each model was evaluated with and without fine tuning on two types of cropped face regions: (1) cropped faces without margin expansion, and (2) cropped faces with 20% margin around the bounding box. The experiments included both gender-agnostic and gender-aware configurations, as well as models with and without fine-tuning. The results are shown in Table 3 (cropped faces without margins) and Table 4 (cropped faces with 20% margin). As shown in Table 3, the gender-aware models with fine-tuning achieved better performance in all cases, with the best accuracy of 91.87% obtained by SWT-DD-SS. When a 20% margin was added around the face, the results were improved further, as shown in Table 4, with the best accuracy reaching 93.84%. The inclusion of gender information consistently improved both F1-score and accuracy, except for DeiT when having faces with margins. This is due to DeiT’s distillation bias from CNN teachers which favor texture-heavy cues for the 20% extra margin around face.

The experiments confirm that context from expanded margins and gender-awareness contribute to enhanced detection performance for driver drowsiness.

4.5. Evaluation of Region-Level Models

While different body regions provide some cues to detect drowsiness, eye and mouth dynamics provide the most reliable and discriminative visual indicators. In this set of experiments, we evaluated two dual-stream architectures, focusing on eye and mouth regions. The proposed architecture leverages specialized feature extraction for different facial regions. The results are shown in Table 5. The experiments were conducted using different pretrained transformer-based and CNN-based architectures, with and without fine-tuning. These results clearly show that gender-aware architectures consistently yield better results compared to the gender-agnostic architectures and training setups. This is indicated by the percentage of improvement in the last column, except for DeiT. This proves the effectiveness of adding gender-specific cues in enhancing model discriminability. Additionally, fine-tuning the pretrained models led to significant performance improvements, particularly for transformer-based architectures. The DeiT-2S saw a significant increase in F1-score from 88.58% to 93.26% upon fine-tuning, while ViT-2S improved from 86.68% to 94.09%. These results show that fine-tuning improves how global representations adjust to the specific characteristics of the dataset. It helps handle issues like changes in lighting, partial occlusions, and facial movements. In contrast, the performance improvement for the CNN-based ResNet50-2S was relatively moderate. This indicates that traditional CNN architectures may be less capable of leveraging fine-tuning compared to transformer-based models.

We can also see that, unlike the convolutional model (ResNet50), transformer-based models (ViT, DeiT, SWT) on eye–mouth regions perform better than on the entire cropped face. This highlights the fundamental architecture difference between CNNs, such as ResNet50, and transformers, such as ViT, DeiT and Swin-T. CNNs rely on receptive fields and hierarchical aggregation which can benefit more from global spatial context of the entire face. In contrast, transformers excel at structured localized attention to discover relevant spatial dependencies. When feeding the transformer with the entire face, it attempts to focus on tiny eye and mouth regions within the large image, but this leads to diffusing attention and wasting capacity on non-informative regions. In contrast, when training the transformer on eye and mouth regions only, it focuses more on drowsiness-relevant micro-features, leading to better results. Figure 5 presents Grad-CAM visualizations of four categories of samples, showing how our model focuses on the eye and mouth regions for drowsiness classification. Figure 6 shows accuracy and loss curves on train/val history of the proposed model. DeiT-2S also performed well, showing its ability to work efficiently with small-scale data using token-level regularization. ViT-2S shows better performance after fine-tuning but still the score is slightly below SWT-2S, possibly due to lack of hierarchical feature extraction. ResNet50-2S gives decent performance but lower accuracy and F1-scores. This suggests that CNN models are less flexible in capturing fine details in the eye–mouth regions separately.

For instance, the fine-tuned gender-aware SWT-DD-3S model scored the highest overall accuracy (95.47%) and F1-score (95.59%), which is higher than the gender-agnostic variant by approximately 0.31% and 0.35%, respectively. We observe similar performance gaps across ResNet50-2S, DeiT-2S, ViT-2S, and SWT-DD-2S.

4.6. Evaluation of the Proposed Models

The hierarchical attention structure of SWT-DD-3S likely contributes to its superior performance, as the model captures local and global dependencies in parallel. In summary, we can draw three main observations from these experiments. First, gender-aware models showed consistently better performance over all architectures. This validates the hypothesis that gender-specific facial dynamics provide valuable discriminative information. Figure 7 supports this finding, which shows accuracy gain of gender-aware over gender-agnostic architecture for 20 configurations of our model. Second, fine-tuning, particularly for transformer-based models, improves model generalization. Third, the proposed gender-aware SWT-DD-3S model shows the most robust performance among all tested configurations. It benefits from its hierarchical attention mechanism and effective fusion of features from eye, mouth, and face regions.

4.7. Ablation Study

We conducted a systematic ablation study, guided by Algorithm 1 by varying the input size, architecture, and training parameters to find the best combination. Inputs are resized to be either

96 \times 96

,

128 \times 128

,

224 \times 224

, or

256 \times 256

(multiple of 32 to be Swin-compatible). For architecture, we include different fusion strategies (summation (S), concatenation (C), or cross-stream attention (A)), and the classifier head can be linear or MLP. The backbone Swin Transformer model can be pretrained on ImageNet (Pret) or trained from scratch, and the backbone freezing (FreezeB) can be partial or full, and network branch usage (Eyes only (E), or Eyes + Mouth (E+M)).

Algorithm 1 Ablation study workflow.

Require:: Dataset $D$ ; configuration search space $C$
Ensure:: Performance metrics (Prc, Rec, F1, Acc, Latency)

▹ Explore different models

for

i = 1

to 16 do

Random configuration

C_{i}

from

C

GA ← False

Train model

M_{i} \leftarrow T R A I N M O D E L (C_{i}, D)

Evaluate

M_{i}

, record metrics

GA ← True

Train model

M_{i + 16} \leftarrow T R A I N M O D E L (C_{i}, D)

Evaluate

M_{i + 16}

, record metrics

end for

▹ Baseline Selection

Find best-performing config

C_{base}

across 30 runs

▹ Finer optimization of baseline

for each modified parameter set p near

C_{base}

do

Define

C_{p}

by adjusting one/two parameters in

C_{base}

Train

M_{p} \leftarrow T R A I N M O D E L (C_{p}, D)

Evaluate

M_{p}

if

M_{p} > > M_{base}

then

M_{base} \leftarrow M_{p}

end if

end for

We also varied the training parameters to include augmentation strength (light or heavy), optimizer (AdamW or SGD), and LR schedule (ReduceLROnPlateau or cosine with warm-up), where LR =

10^{- n}

and

n \in {3, 4, 5}

. We explored 16 different combinations randomly and evaluated them for gender-aware and gender-agnostic. We analyzed the model feasibility for real-time deployment based on computational performance and resource requirements. We conducted an ablation study and reported computational indicators such as parameter count, serialized model size (MB), and inference latency/throughput from batched random inputs. As per the ablation results in Table 6, the proposed model consistently performs well across 32 runs. The model has a moderate number of parameters (approx. 55 M), low memory size (approx. 210–215 MB), and low latency per image at

224 \times 224

input (approx. 1.75 ms); these characteristics make it suitable for deployment on edge and in-vehicle devices with constrained resources.

The best configuration is found when the model is gender-aware with no pretraining; runs on both eyes and mouth branches; has sum fusion, MLP-256 head, 0.05 dropout, lr = 0.00001, and light augmentation. The final fused 3S model reaches 95.42% precision, 94.70% recall, 95.06% F1-score and 94.93% accuracy. Turning on gender-aware yields small but consistent gains when other settings are kept constant. Matched pairs indicate a +0.23–0.53% accuracy improvement for the gender-aware architecture (e.g., s8/s18: +0.23% Acc; s11/s27: +0.53% Acc; s10/s20: +0.31% Acc). One linear head pair slightly favors the gender-agnostic model (s9 vs. s19:

- 0.28 %

Acc), but the median gain across matched pairs remains +0.24% in favor of the gender-aware design.

Overall, incorporating gender awareness is beneficial with minimal downside risk. As for fusion and head design,

s u m

outperforms

c o n c a t

on average. With sum fusion, an MLP head outperforms a linear head (s18 vs. s19: +0.63 Acc). The best trade-off is achieved with MLP-256/512 with modest dropout (0.05–0.1). Larger dropout (0.2) mildly hurts performance (s24). In our ablation study, sum fusion mostly outperformed ‘concat’ fusion, and it yielded higher average accuracy in both gender-agnostic (94.23 vs. 93.50) and gender-aware models (94.30 vs. 94.10). By contrast, ‘attention’ fusion combined with heavy augmentation proved to be unstable, showed poor performance (F1 66.39/Acc 72.89) and was therefore discarded. Regarding the classification head, MLP heads performed better than linear heads under sum fusion. Here, the best balance is achieved using MLP-256/512 and modest dropout (0.05–0.1). A larger value for dropout (0.2) slightly reduced performance. On input resolution, increasing image size consistently raised latency by about 68% (1.75 ms to 2.94 ms) without performance gains. Also, it slightly reduced the accuracy across both gender-agnostic and gender-aware models, confirming that

224 \times 224

is the optimal accuracy–latency trade-off. Results show that pretrained models performed better at a default learning rate of 1 × 10⁻⁴. Freezing the network backbone reduced accuracy by −0.5 to −1.1 percentage points, which indicates that full fine-tuning is critical. Finally, the proposed model is lightweight, and the gender-aware component adds only a small amount of extra computation. Its simple design keeps memory usage and delay low. With its size and speed, the model can run in real time on embedded automotive systems, especially when using common optimization methods like pruning, quantization, and hardware acceleration.

5. Gender-Based Fairness Evaluation

We also evaluated the gender-based fairness of the drowsiness detection model (SWT-DD-3S) using different metrics, as shown in Table 7. The fairness analysis in this study is restricted to a binary gender setting, as provided in the NTHU-DDD dataset. The reported metrics are per-gender True Positive Rate (

{TPR}_{g}

) where

g \in {m a l e, f e m a l e}

; Equal Opportunity Difference (EOD); per-gender Positive Prediction Rate (

{PPR}_{g}

); Positive Prediction Rate

{(PPR)}_{g}

; Demographic Parity Difference (DPD); Disparate Impact Ratio (DIR); and per-gender Brier Score. These metrics are computed following Equations (6)–(11). Here,

G_{m}

and

G_{f}

denote the male and female subsets, respectively. For each sample i,

y_{i} \in {0, 1}

is the ground-truth label (1 = drowsy, 0 = alert), and

{\hat{y}}_{i}

is the predicted label. The predicted probability of the drowsy class is denoted as

{\hat{p}}_{i}

, and the number of samples in gender group g is represented by

N_{g}

. Smaller values of EOD and DPD indicate lower bias, while a DIR value close to 1 and similar Brier scores across genders imply well-calibrated, fair predictions.

\begin{matrix} {TPR}_{g} = \frac{{TP}_{g}}{{TP}_{g} + {FN}_{g}}, g \in {m, f} \end{matrix}

(6)

\begin{matrix} EOD = |{TPR}_{m} - {TPR}_{f}| \end{matrix}

(7)

\begin{matrix} {PPR}_{g} = P (\hat{Y} = 1 ∣ g) = \frac{{TP}_{g} + {FP}_{g}}{N_{g}} \end{matrix}

(8)

\begin{matrix} DPD = |{PPR}_{m} - {PPR}_{f}| \end{matrix}

(9)

\begin{matrix} DIR = \frac{{PPR}_{f}}{{PPR}_{m}} \end{matrix}

(10)

\begin{matrix} Brier {Score}_{g} = \frac{1}{N_{g}} \sum_{i \in G_{g}} {({\hat{p}}_{i} - y_{i})}^{2}, g \in {m, f} \end{matrix}

(11)

Comparisons with Existing Models and Limitations

Table 8 summarizes contemporary methods evaluated on the NTHU-DDD dataset for drowsiness detection. Early approaches such as [42] relied on multi-stream CNNs over raw video frames, which achieved an accuracy of 73.06%. Subsequent works introduced temporal modeling to capture motion dynamics. For example, MSTN [43] and ConvGRNN [44] improved performance to around 82–85%. The adoption of recurrent and fusion-based strategies [45,46] improved accuracy above 90%. More recent designs include graph-based attention (e.g., MSTAGCN [22]), which achieved 92.4% by leveraging facial landmark dynamics. In contrast, our proposed three-stream Swin Transformer model achieved the highest reported performance of 95.47% accuracy. This indicates that robust spatial feature extraction and cross-region fusion (face, eyes, and mouth) can effectively capture drowsiness-related cues. These results highlight the efficacy of transformer-based architecture for visual drowsiness detection. However, the results depend on the characteristics of the benchmark dataset. The primary concerns about the dataset include a relatively small number of participants (36 subjects), and limited diversity of driving conditions. Despite these limitations, the proposed methodology is general and can be evaluated on other datasets as they become available. We also considered testing our model on other datasets such as YawDD and UTA-RLDD. However, these datasets do not provide reliable frame-level drowsiness labels. In addition, their gender distributions are highly unbalanced, making them less suitable for evaluating a gender-aware approach. In future work, we plan to use YawDD and UTA-RLDD after applying preprocessing and labeling methods.

6. Conclusions

This paper explored a three-stream gender-aware Transformer architecture with shifted windows for driver drowsiness detection. By explicitly integrating gender embeddings to modulate the region-based features, the model successfully emphasizes gender-specific indicators of drowsiness. Extensive evaluations on the NTHU-DDD dataset revealed that incorporating gender information can significantly enhance detection accuracy, reaching 95.47%. Moreover, the proposed method consistently outperformed DeiT, ViT, and ResNet50. This proves the model’s robustness and effectiveness in both gender-aware and gender-agnostic contexts. These findings underscore the importance of demographic considerations in drowsiness detection systems. It also helps design more personalized and accurate automated driver safety solutions. As future work, it is recommended to focus on extending the model to accommodate additional demographic factors or modalities such as age and ethnicity to enhance and personalize facial features and drowsiness-related patterns. Furthermore, analyzing driver behavior using more fine-grained action classes may further improve the practical applicability and robustness of driver alertness monitoring frameworks.

Author Contributions

Conceptualization, E.-S.M.E.-A.; methodology, M.F.N. and E.-S.M.E.-A.; software, M.F.N.; validation, M.F.N. and E.-S.M.E.-A.; formal analysis, M.F.N. and E.-S.M.E.-A.; investigation, M.F.N. and E.-S.M.E.-A.; data curation, M.F.N.; writing—original draft preparation, M.F.N.; writing—review and editing, M.F.N. and E.-S.M.E.-A.; visualization, M.F.N.; supervision, E.-S.M.E.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by KFUPM under grant number INSS2204, and the APC was also supported by KFUPM.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is publicly available as cited in the paper and more information can be found at http://cv.cs.nthu.edu.tw/php/callforpaper/datasets/DDD/.

Acknowledgments

The authors gratefully acknowledge support from King Fahd University of Petroleum and Minerals through the Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS).

Conflicts of Interest

The authors declare no conflicts of interest.

References

National Highway Traffic Safety Administration. Drowsy driving. Available online: https://www.nhtsa.gov/risky-driving/drowsy-driving (accessed on 7 March 2026).
Hashemi, M.; Mirrashid, A.; Beheshti Shirazi, A. Driver safety development: Real-time driver drowsiness detection system based on convolutional neural network. SN Comput. Sci. 2020, 1, 289. [Google Scholar] [CrossRef]
Dua, M.; Shakshi; Singla, R.; Raj, S.; Jangra, A. Deep CNN models-based ensemble approach to driver drowsiness detection. Neural Comput. Appl. 2021, 33, 3155–3168. [Google Scholar] [CrossRef]
Albadawi, Y.; AlRedhaei, A.; Takruri, M. Real-time machine learning-based driver drowsiness detection using visual features. J. Imaging 2023, 9, 91. [Google Scholar] [CrossRef] [PubMed]
Jahan, I.; Uddin, K.A.; Murad, S.A.; Miah, M.S.U.; Khan, T.Z.; Masud, M.; Aljahdali, S.; Bairagi, A.K. 4D: A real-time driver drowsiness detector using deep learning. Electronics 2023, 12, 235. [Google Scholar] [CrossRef]
Salem, D.; Waleed, M. Drowsiness detection in real-time via convolutional neural networks and transfer learning. J. Eng. Appl. Sci. 2024, 71, 122. [Google Scholar] [CrossRef]
Jarndal, A.; Tawfik, H.; Siam, A.I.; Alsyouf, I.; Cheaitou, A. A real-time vision transformers-based system for enhanced driver drowsiness detection and vehicle safety. IEEE Access 2024, 13, 1790–1803. [Google Scholar] [CrossRef]
Zhang, Z.; Ning, H.; Zhou, F. A systematic survey of driving fatigue monitoring. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19999–20020. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-shot multi-level face localisation in the wild. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020; pp. 5203–5212. [Google Scholar]
Guo, X.; Li, S.; Yu, J.; Zhang, J.; Ma, J.; Ma, L.; Liu, W.; Ling, H. PFLD: A practical facial landmark detector. arXiv 2019, arXiv:1902.10859. [Google Scholar] [CrossRef]
Xiao, W.; Liu, H.; Ma, Z.; Chen, W.; Sun, C.; Shi, B. Fatigue driving recognition method based on multi-scale facial landmark detector. Electronics 2022, 11, 4103. [Google Scholar] [CrossRef]
Makhmudov, F.; Turimov, D.; Xamidov, M.; Nazarov, F.; Cho, Y.I. Real-time fatigue detection algorithms using machine learning for yawning and eye state. Sensors 2024, 24, 7810. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Stancin, I.; Zeba, M.Z.; Friganovic, K.; Cifrek, M.; Jovic, A. Information on drivers’ sex improves EEG-based drowsiness detection model. Appl. Sci. 2022, 12, 8146. [Google Scholar] [CrossRef]
Clavell, G.G.; González-Sendino, R.; Vazquez, P. Demographic benchmarking: Bridging socio-technical gaps in bias detection. arXiv 2025, arXiv:2501.15985. [Google Scholar] [CrossRef]
Deng, W.; Wu, R. Real-time driver-drowsiness detection system using facial features. IEEE Access 2019, 7, 118727–118738. [Google Scholar] [CrossRef]
Guo, J.M.; Markoni, H. Driver drowsiness detection using hybrid convolutional neural network and long short-term memory. Multimed. Tools Appl. 2019, 78, 29059–29087. [Google Scholar] [CrossRef]
Maior, C.B.S.; das Chagas Moura, M.J.; Santana, J.M.M.; Lins, I.D. Real-time classification for autonomous drowsiness detection using eye aspect ratio. Expert Syst. Appl. 2020, 158, 113505. [Google Scholar] [CrossRef]
Fa, S.; Yang, X.; Han, S.; Feng, Z.; Chen, Y. Multi-scale spatial–temporal attention graph convolutional networks for driver fatigue detection. J. Vis. Commun. Image Represent. 2023, 93, 103826. [Google Scholar] [CrossRef]
Xiao, W.; Liu, H.; Ma, Z.; Chen, W.; Hou, J. FPIRST: Fatigue driving recognition method based on feature parameter images and a residual Swin transformer. Sensors 2024, 24, 636. [Google Scholar] [CrossRef]
Mate, P.; Patil, A.; Talhar, M.; Khade, A. Detection of driver drowsiness using transfer learning techniques. Multimed. Tools Appl. 2024, 83, 35237–35255. [Google Scholar] [CrossRef]
Essahraui, S.; Lamaakal, I.; El Hamly, I.; Maleh, Y.; Ouahbi, I.; El Makkaoui, K.; Filali Bouami, M.; Pławiak, P.; Alfarraj, O.; Abd El-Latif, A.A. Real-time driver drowsiness detection using facial analysis and machine learning techniques. Sensors 2025, 25, 812. [Google Scholar] [CrossRef]
Hassan, O.F.; Ibrahim, A.F.; Gomaa, A.; Makhlouf, M.; Hafiz, B. Real-time driver drowsiness detection using transformer architectures: A novel deep learning approach. Sci. Rep. 2025, 15, 17493. [Google Scholar] [CrossRef] [PubMed]
Abd El-Nabi, S.; Ibrahim, A.F.; El-Rabaie, E.S.M.; Hassan, O.F.; Soliman, N.F.; Ramadan, K.F.; El-Shafai, W. Driver drowsiness detection using swin transformer and diffusion models for robust image denoising. IEEE Access 2025, 13, 71880–71907. [Google Scholar] [CrossRef]
Rahmani, C.; Benlamoudi, A.; Bounab, Y.; Bekhouche, S.E.; Samai, D.; Dornaika, F.; Taleb, A.; Belhaouari, S.B. A Semi-supervised neural framework for real-time drowsiness detection using facial cues. IEEE Access 2026, 14, 12816–12836. [Google Scholar] [CrossRef]
Thampi, L.L.; Neethu, C.T.; Reddy, A.K.; Khan, I.A.; Aswathy, M.A.; Kumar, A.; Kumar, S. Smart driver assistance: Real-time drowsiness detection leveraging facial cues with MediaPipe and OpenCV. Int. J. Intell. Transp. Syst. Res. 2026. [Google Scholar] [CrossRef]
Bhanja, A.; Parhi, D.; Gajendra, D.; Sinha, K.; Sahoo, A.K. Driver drowsiness shield (DDSH): A real-time driver drowsiness detection system. Robomech J. 2025, 12, 1–11. [Google Scholar] [CrossRef]
Abo-Zahhad, M.M.; Elghamrawy, S.; Hefny, A.A.; Elawady, Y.H. Early drowsiness detection model in autonomous vehicles using GAN and YOLO integration. Neural Comput. Appl. 2025, 37, 28353–28378. [Google Scholar] [CrossRef]
Lin, L.; Wang, S.; Yang, J.; Wei, F. A multi-aware graph convolutional network for driver drowsiness detection. Knowl.-Based Syst. 2024, 305, 112643. [Google Scholar] [CrossRef]
Gao, Z.; Duan, P.; Li, R.; Tong, Z. A hybrid GCN-LSTM model for driver drowsiness detection. In SPIE—Fourth International Conference on Signal Processing and Computer Science (SPCS 2023); Nayyar, A., Kolivand, H., Eds.; SPIE: Washington, DC, USA, 2023. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Azmi, M.M.B.M.; Zaman, F.H.K. Driver drowsiness detection using vision transformer. In 2024 IEEE 14th Symposium on Computer Applications & Industrial Electronics (ISCAIE); IEEE: Piscataway, NJ, USA, 2024; pp. 329–336. [Google Scholar]
Phan, T.-C.; Phan, A.-C.; Nguyen, N.-H. A novel approach of drowsiness levels detection using Vis-Net combined with facial emotion. Syst. Soft Comput. 2025, 7, 200288. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning; PMLR: London, UK, 2021; pp. 10347–10357. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 558–567. [Google Scholar]
Khan, S.S.; Sengupta, D.; Ghosh, A.; Chaudhuri, A. MTCNN++: A CNN-based face detection algorithm inspired by MTCNN. Vis. Comput. 2024, 40, 899–917. [Google Scholar] [CrossRef]
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2018; Volume 32. [Google Scholar]
Weng, C.H.; Lai, Y.H.; Lai, S.H. Driver drowsiness detection via a hierarchical temporal deep belief network. In Computer Vision–ACCV 2016 Workshops; Revised Selected Papers, Part III 13; Springer: Cham, Switzerland, 2017; pp. 117–133. [Google Scholar]
Park, S.; Pan, H.; Kang, S.; Yoo, C. Driver drowsiness detection system based on feature representation learning using various deep networks. In ACCV 2016 Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; pp. 154–164. [Google Scholar] [CrossRef]
Shih, J.; Hsu, Y. MSTN: Multistage spatial–temporal network for driver drowsiness detection. In ACCV 2016 Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; pp. 146–153. [Google Scholar] [CrossRef]
Dang, T.; Hoang, H.; Do, T.; Pham, V. A deep neural network for real-time driver drowsiness detection. IEICE Trans. Inf. Syst. 2019, 102, 1374–1383. [Google Scholar] [CrossRef]
Lyu, S.; Yuan, J.; Chen, Y. Long-term multi-granularity deep framework for driver drowsiness detection. arXiv 2018, arXiv:1801.02325. [Google Scholar]
Shen, J.; Wang, X.; Song, Y. Robust two-stream multi-feature network for driver drowsiness detection. arXiv 2020, arXiv:2010.06235. [Google Scholar]
Yu, J.; Park, S.; Lee, S.; Jeon, M. Driver drowsiness detection using condition-adaptive representation learning framework. IEEE Trans. Intell. Transp. Syst. 2019, 20, 4206–4218. [Google Scholar] [CrossRef]
Tüfekci, G.; Kayabaşı, A.; Akagündüz, E.; Ulusoy, I. Detecting driver drowsiness as an anomaly using LSTM autoencoders. In ECCV 2022 Workshops; Springer: Cham, Switzerland, 2023. [Google Scholar]

Figure 1. Overall system architecture and workflow (the face image is an example from the NTHU-DDD dataset and is reproduced with permission).

Figure 2. Proposed gender-aware multi-stream shifted window-based transformer model (the drowsy face image is an example from the NTHU-DDD dataset and is reproduced with permission).

Figure 3. Cropping drowsy face regions using RetinaFace (the drowsy face image is an example from the NTHU-DDD dataset and is reproduced with permission).

Figure 4. NTHU-DDD sample images (reproduced with permission).

Figure 5. Grad-CAM visualizations on 4 cases illustrating model attention to eyes and mouth regions: (a) Male Non-Drowsy. (b) Male Drowsy. (c) Female Non-Drowsy. (d) Female Drowsy.

Figure 6. Curves on train/val history of four gender-agnostic and gender-aware models: (a) Accuracy curves. (b) Loss curves.

Figure 7. Accuracy gain of gender-aware (GA) over gender-agnostic (non-GA) architecture for 20 pairs of configurations.

Table 1. Summary of the analyzed related works.

Ref	Year	Model(s)	Accuracy	Strengths and Weaknesses	Dataset(s)
[19]	2019	DriCare: Multiple CNN-kernelized correlation filters with MTCNN	92% (Avg.)	High accuracy under various conditions but performance decreases in low light or when driver wears glasses.	CelebA, YawDD
[20]	2019	Hybrid CNN-LSTM	84.85%	Effectively captures spatiotemporal features; however, it relies heavily on handcrafted data preprocessing and manually defined skip intervals.	ACCV Drowsy
[21]	2020	EAR+SVM	94.44%	Real-time with a short temporal window; however, the evidence base is small and narrow, only based on eye pattern, manual temporal processing via a vector.	DROZY
[2]	2020	FD-NN, TL-VGG16, TL-VGG19	FD-NN: 98.15%, TL-VGG16: 95.45%, TL-VGG19: 95%	Effective in capturing fine-grained features for fatigue but depend on pretrained models and dataset is limited.	Self-prepared ZJU dataset
[3]	2021	Deep-CNN-based ensemble	85%	Considers different types of features but low accuracy.	NTHU-DDD
[5]	2023	VGG16, VGG19, and 4D	VGG16: 95.93%, VGG19: 95.03%, 4D: 97.53%	Consistent results and good temporal analysis but may struggle under dynamic conditions as the dataset is clean and oversimplified.	MRL
[22]	2023	MSSTAGCN built on facial landmark graphs	92.4%	Resilient to lighting changes, occlusions, and skin-tone differences, but relies on accurate OpenPose landmarks and errors can propagate.	NTHU-DDD
[23]	2024	FPIRST: Residual Swin Transformer	96.40%	Captures fine-grained temporal facial features (eyes, mouth) but performance drops in complex scenarios.	HNUFD
[6]	2024	CNN, InceptionV3, MobileNetV2	CNN: 96%, MobileNetV2: 97%, InceptionV3: 98%	Responsive model for real-time but the dataset is oversimplified including only open/closed eyes.	MRL
[13]	2024	VGG16 and CNN	VGG16: 95.85%, CNN: 96.45%	Combines Haar cascades and CNN feature extraction but may need more power in high-complexity tasks and tested on limited dataset.	YawDD, MRL
[24]	2024	VGG19	96.51%	Performs well in different lighting and environmental scenarios but lacks flexibility due to static network structure.	NTHU-DDD
[25]	2025	KNN, SVM, CNN, YOLOv5, YOLOv8, Faster R-CNN	KNN: 98.89% (UTA-RLDD); CNN: 99.97%; YOLOv5/YOLOv8: 99.5%	Real-time, achieved near-perfect accuracy using YOLO, but low performance on YawDD due to lighting/yawning variability, requires significant computational resources.	UTA-RLDD, NTHU-DDD, YawDD
[7]	2025	ViT-DDD	98.89%, 99.4%	Implemented prototype of the model, real-time ViT pipeline that uses full face but used a subset of the data without applying a standard train–test split.	NTHU-DDD, UTA-RLDD
[26]	2025	Swin-T	MRL: 99.03%, NTHU-DDD: 98.76%, CEW: 100%	Explainability via CAM focusing on eye regions but only focused on eyes and evaluation mostly offline.	MRL Eye, CEW, NTHU-DDD
[27]	2025	Swin-T	Eye Blink: 99.82%, CEW: 99.94%	Better denoising capability; however, diffusion is computationally costly.	Eye Blink, CEW
[28]	2026	YOLOv8 + Swin-T	UTA-RLDD: 99.99%; YawDD: 99.34%; NTHU-DDD: 95.94%	Semi-supervised learning reduces labeling dependence; however, pseudo-label noise and face-detection failures can still propagate errors.	NTHU-DDD, YawDD, UTA-RLDD
[29]	2026	CNN + Dlib/MediaPipe + OpenCV	MRL Eye: 84.53%; YawDD: 96.42%	Compares learned CNN cues with classical landmark-based tracking, but CNN accuracy on MRL is moderate.	MRL Eye, YawDD

Table 2. Confusion matrix with performance metrics (T—true, F—false, N—negative, P—positive).

		Predicted Class
		Alert	Drowsy
		(Negative)	(Positive)
Actual	Alert	TN	FP
Actual	Drowsy	FN	TP

Table 3. Evaluation of different gender-agnostic and gender-aware single-stream (SS) models on cropped faces (no margins); bold text means the overall best value.

		Gender Agnostic				Gender Aware
Model	Finetuned?	Prc	Rec	F1	Acc	Prc	Rec	F1	Acc
ResNet50-SS	N	90.96	88.02	89.47	89.29	89.25	90.59	89.92	89.50
ResNet50-SS	Y	91.67	89.94	90.82	90.60	93.82	89.24	91.47	91.40
ViT-SS	N	90.40	84.41	87.30	87.30	88.08	91.86	89.93	89.36
ViT-SS	Y	91.71	90.41	91.05	90.81	92.92	89.93	91.40	91.25
DeiT-SS	N	83.71	81.16	82.42	81.95	84.74	85.79	85.26	84.67
DeiT-SS	Y	92.77	89.63	91.12	90.98	94.08	88.92	91.43	91.38
SWT-DD-SS	N	91.60	87.92	89.82	89.70	91.37	90.35	90.86	90.61
SWT-DD-SS	Y	93.38	89.93	91.62	91.50	94.87	89.08	91.89	91.87

Table 4. Evaluation of different gender-agnostic and gender-aware single-stream (SS) models on cropped faces with 20% margin; bold text means the overall best value.

		Gender Agnostic				Gender Aware
Model	Finetuned?	Prc	Rec	F1	Acc	Prc	Rec	F1	Acc
ResNet50-SS	N	92.60	87.71	90.09	90.03	90.39	90.20	90.65	90.39
ResNet50-SS	Y	95.91	86.47	90.94	91.10	93.45	90.56	91.98	91.84
ViT-SS	N	89.97	88.56	89.26	88.99	88.08	91.86	89.93	89.36
ViT-SS	Y	91.58	89.45	90.50	90.11	92.74	90.90	91.81	91.62
DeiT-SS	N	82.94	90.25	86.44	85.08	85.68	82.73	84.18	83.93
DeiT-SS	Y	92.06	91.48	91.77	91.52	93.33	89.09	91.16	91.07
SWT-DD-SS	N	92.79	85.52	89.01	89.08	94.02	91.19	92.58	92.45
SWT-DD-SS	Y	95.09	88.88	91.88	91.88	93.44	94.74	94.08	93.84

Table 5. Evaluation of different gender-agnostic and gender-aware multi-stream (2S and 3S) models on cropped eye–mouth regions; bold means the results are better compared to gender-agnostic, and bold italics means the best overall; where PAI is the percentage accuracy improvement.

		Gender Agnostic				Gender Aware
Model	Finetuned?	Prc	Rec	F1	Acc	Prc	Rec	F1	Acc	PAI
ResNet50-2S	N	82.61	90.56	86.40	85.28	89.66	83.87	86.67	86.67	+1.39
ResNet50-2S	Y	90.38	84.42	87.30	87.31	91.29	87.21	89.20	89.10	+1.79
ViT-2S	N	84.95	85.99	85.47	84.93	85.83	87.54	86.68	86.13	+1.2
ViT-2S	Y	93.01	94.34	93.67	93.43	93.13	95.07	94.09	93.85	+0.42
DeiT-2S	N	87.68	87.30	87.49	87.14	89.33	87.84	88.58	88.33	+0.79
DeiT-2S	Y	92.22	92.85	92.53	92.28	94.47	92.07	93.26	93.14	−0.45
SWT-DD-2S	N	89.41	88.77	89.09	88.80	88.97	91.08	90.01	89.59	+0.79
SWT-DD-2S	Y	93.13	95.07	94.09	93.75	95.42	94.70	95.06	94.93	+1.18
SWT-DD-3S	Y	95.24	95.32	95.28	95.12	95.98	95.21	95.59	95.47	+0.35

Table 6. Impact of image size (IMG), learning rate (LR), gender awareness (GA), pretraining (Pret), backbone freezing (Freeze.B), modality (Modal), fusion method (Fusion), head type (Head), hidden units (H.Hid), dropout (H.Drop), and augmentation (Aug) on the performance of the two-stream model in terms of number of parameters in millions (#P) and model size in megabyte (Size), latency in ms (L), precision (Prc), recall (Rec), F1, and accuracy (Acc); bold text means the overall best value.

Ref	IMG	LR	GA	Pret	Freeze.B	Modal	Fusion	Head	H.Hid	H.Drop	Aug	#P	Size	L	Prc	Rec	F1	Acc
s41	96	0.0001	True	True	False	E+M	S	linear	0	0.0	light	55.04	210.56	0.47	96.2	92.56	94.34	94.28
s42	96	0.001	True	True	False	E+M	S	linear	0	0.0	light	55.04	210.56	0.47	51.51	100.0	68.0	51.51
s44	96	0.00001	True	False	False	E	S	linear	0	0.0	light	55.04	210.56	0.24	90.86	90.21	90.54	90.28
s45	96	0.00001	False	True	False	E+M	A	mlp	384	0.25	light	57.7	220.68	0.48	94.1	94.76	94.43	94.24
s46	96	0.00001	False	True	False	E+M	S	mlp	256	0.25	light	55.24	211.29	0.47	74.59	58.19	65.38	68.25
s49	96	0.001	False	True	False	E+M	S	linear	0	0.0	light	55.04	210.54	0.47	51.51	100.0	68.0	51.51
s50	96	0.00001	False	False	False	E+M	S	linear	0	0.0	light	55.04	210.54	0.47	93.49	91.45	92.46	92.31
s51	96	0.00001	False	False	False	E	S	linear	0	0.0	light	55.04	210.54	0.24	92.05	89.32	90.67	90.53
s52	128	0.00001	True	True	False	E+M	S	linear	0	0.0	light	55.04	210.56	0.96	94.87	94.48	94.67	94.52
s55	128	0.00001	True	False	False	E+M	S	linear	0	0.0	light	55.04	210.56	0.96	91.34	93.62	92.47	92.14
s56	128	0.00001	True	False	False	E	S	linear	0	0.0	light	55.04	210.56	0.48	93.41	90.41	91.89	91.77
s57	160	0.00001	True	True	False	E+M	S	linear	0	0.0	light	55.04	210.56	1.17	95.82	93.66	94.73	94.63
s58	160	0.0001	True	True	False	E+M	S	linear	0	0.0	light	55.04	210.56	1.17	93.7	94.6	94.15	93.94
s61	160	0.00001	True	False	False	E	S	linear	0	0.0	light	55.04	210.56	0.58	92.48	92.3	92.39	92.17
s62	160	0.00001	True	False	False	E	S	linear	0	0.0	light	55.04	210.56	0.58	92.53	88.84	90.65	90.56
s1	224	0.0001	False	True	False	E+M	C	linear	0	0.0	light	55.04	210.55	1.75	95.16	93.03	94.08	93.97
s3	224	0.0001	False	True	False	E+M	A	mlp	384	0.2	heavy	57.7	220.68	1.75	91.88	51.97	66.39	72.89
s4	224	0.0001	False	False	False	E+M	C	linear	0	0.0	light	55.04	210.55	1.75	91.42	91.11	91.27	91.02
s6	224	0.0001	False	True	False	E	S	linear	0	0.0	light	55.04	210.54	0.87	95.02	91.13	93.04	92.97
s8	224	0.0001	False	True	False	E+M	S	mlp	512	0.2	light	55.43	212.04	1.75	94.96	93.97	94.46	94.32
s15	224	0.0001	False	True	True	E+M	C	mlp	256	0.1	light	55.44	212.05	1.75	92.93	93.34	93.14	92.91
s16	224	0.00001	False	True	False	E+M	S	mlp	256	0.05	light	55.24	211.29	1.75	95.33	93.81	94.57	94.45
s12	224	0.00001	False	True	False	E	S	linear	0	0.0	light	55.04	210.54	0.87	94.83	93.24	94.03	93.9
s18	224	0.0001	True	True	False	E+M	S	mlp	512	0.1	light	55.83	213.56	1.75	94.88	94.51	94.7	94.55
s20	224	0.0001	True	True	False	E+M	C	mlp	256	0.05	light	55.83	213.56	1.75	94.87	93.76	94.31	94.17
s24	224	0.0001	True	True	False	E+M	S	mlp	512	0.2	light	55.83	213.56	1.75	93.38	95.32	94.34	94.11
s26	224	0.0001	True	True	False	E+M	C	mlp	256	0.05	light	55.83	213.56	1.75	95.56	92.43	93.97	93.89
s27	224	0.0001	True	True	False	E+M	C	mlp	384	0.1	light	56.23	215.07	1.75	94.67	94.53	94.6	94.44
s32	224	0.00001	True	True	False	E+M	S	mlp	256	0.05	light	55.44	212.05	1.75	95.42	94.7	95.06	94.93
s36	224	0.00001	True	False	False	E+M	S	linear	0	0.0	heavy	55.04	210.56	1.75	77.45	53.26	63.12	67.94
s38	224	0.00001	True	True	False	E+M	A	mlp	384	0.25	light	58.0	221.82	1.75	93.01	95.61	94.29	94.03
s39	224	0.00001	True	True	False	E+M	S	mlp	256	0.25	light	55.44	212.05	1.75	75.9	64.87	69.95	71.29
s23	224	0.00001	True	False	False	E+M	C	linear	0	0.0	light	55.05	210.58	1.75	94.06	93.28	93.67	93.5
s28	224	0.00001	True	True	False	E	S	linear	0	0.0	light	55.04	210.56	0.87	94.16	94.42	94.29	94.11
s13	256	0.0001	False	True	False	E+M	C	linear	0	0.0	light	55.04	210.55	2.93	93.65	93.15	93.4	93.21
s14	256	0.0001	False	True	False	E+M	S	mlp	384	0.1	light	55.34	211.67	2.93	94.48	93.5	93.99	93.84
s22	256	0.0001	True	True	False	E+M	C	linear	0	0.0	light	55.05	210.58	2.94	93.99	93.81	93.9	93.72
s29	256	0.0001	True	True	False	E+M	C	linear	0	0.0	light	55.05	210.58	2.94	92.84	95.25	94.03	93.76
s66	224	0.00001	True	True	False	E+M+F	S	mlp	224	0.05	light	57.44	217.08	3.57	95.98	95.21	95.59	95.47

Table 7. Gender-based fairness evaluation of the SWT-DD-3S model.

Metric	Male	Female	Difference/Ratio	Interpretation
TPR	0.959	0.947	EOD = 0.012	Minor disparity
PPR	0.515	0.510	DPD = 0.005	Acceptable bias
DIR	–	–	0.990	Within fair range
Brier Score	0.034	0.040	–	Well calibrated

Table 8. Comparisons in the literature on NTHU-DDD vision-based drowsiness detection; bold text means the overall best values.

Paper	Approach	Modality	Acc	Testing Data
[42]	CNN with late fusion	3 streams (global, face, motion)	73.06	Eval
[43]	MSTN: CNN + LSTM + temporal smoothing	Face crops	82.61	Eval, Test
[45]	Multi-granularity CNN + fusion	Face patches	90.05	Eval
[44]	ConvCGRNN real-time	Frames	84.81	Eval
[47]	3D-CNN + condition fusion	Full frame + face	76.2	Eval
[46]	Two-Stream Multi-Feature Net	RGB + flow + landmarks	94.46	Eval
[22]	Multi-scale spatio-temporal attention GCN	Face landmarks	92.4	Test (Land-marked)
[48]	ResNet-34 + LSTM	Face video	87.40	Eval
Ours	SWT-DD-2S	Eyes, mouth and gender	94.93	Eval
Ours	SWT-DD-3S	Eyes, mouth, face and gender	95.47	Eval

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nurnoby, M.F.; El-Alfy, E.-S.M. Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers. Appl. Sci. 2026, 16, 3353. https://doi.org/10.3390/app16073353

AMA Style

Nurnoby MF, El-Alfy E-SM. Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers. Applied Sciences. 2026; 16(7):3353. https://doi.org/10.3390/app16073353

Chicago/Turabian Style

Nurnoby, M. Faisal, and El-Sayed M. El-Alfy. 2026. "Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers" Applied Sciences 16, no. 7: 3353. https://doi.org/10.3390/app16073353

APA Style

Nurnoby, M. F., & El-Alfy, E.-S. M. (2026). Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers. Applied Sciences, 16(7), 3353. https://doi.org/10.3390/app16073353

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers

Abstract

1. Introduction

2. Related Work

2.1. Convolutional-Based Feature Representations

2.2. Transformer-Based Methods

3. Proposed Methodology

3.1. Extraction of Drowsy Facial Regions

3.2. Extraction of Spatial Features

3.3. Gender-Aware Fusion of Facial Features for Classification

4. Results and Discussion

4.1. NTHU-DDD Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Evaluation of Single-Stream Models on Cropped Faces with and Without Margins

4.5. Evaluation of Region-Level Models

4.6. Evaluation of the Proposed Models

4.7. Ablation Study

5. Gender-Based Fairness Evaluation

Comparisons with Existing Models and Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI