Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers
Abstract
1. Introduction
- We propose a gender-aware multi-stream transformer with shifted windows that fuses eye and mouth encoders to learn discriminative features for drowsiness detection.
- We evaluate a diverse set of deep learning techniques for gender-aware and gender-agnostic driver drowsiness detection.
- We optimize a proposed two-stream approach based on comprehensive ablation experiments to determine the best configuration settings for pretrained and non-pretrained variants for robust drowsiness detection.
- We validate and compare the proposed model through extensive experiments, providing empirical evidence of gender influence and bias on vision-based drowsiness detection.
2. Related Work
2.1. Convolutional-Based Feature Representations
2.2. Transformer-Based Methods
3. Proposed Methodology
3.1. Extraction of Drowsy Facial Regions
3.2. Extraction of Spatial Features
3.3. Gender-Aware Fusion of Facial Features for Classification
4. Results and Discussion
4.1. NTHU-DDD Dataset
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Evaluation of Single-Stream Models on Cropped Faces with and Without Margins
4.5. Evaluation of Region-Level Models
4.6. Evaluation of the Proposed Models
4.7. Ablation Study
| Algorithm 1 Ablation study workflow. |
▹ Explore different models for to 16 do Random configuration from GA ← False Train model Evaluate , record metrics GA ← True Train model Evaluate , record metrics end for |
▹ Baseline Selection Find best-performing config across 30 runs |
▹ Finer optimization of baseline for each modified parameter set p near do Define by adjusting one/two parameters in Train Evaluate if then end if end for |
5. Gender-Based Fairness Evaluation
Comparisons with Existing Models and Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- National Highway Traffic Safety Administration. Drowsy driving. Available online: https://www.nhtsa.gov/risky-driving/drowsy-driving (accessed on 7 March 2026).
- Hashemi, M.; Mirrashid, A.; Beheshti Shirazi, A. Driver safety development: Real-time driver drowsiness detection system based on convolutional neural network. SN Comput. Sci. 2020, 1, 289. [Google Scholar] [CrossRef]
- Dua, M.; Shakshi; Singla, R.; Raj, S.; Jangra, A. Deep CNN models-based ensemble approach to driver drowsiness detection. Neural Comput. Appl. 2021, 33, 3155–3168. [Google Scholar] [CrossRef]
- Albadawi, Y.; AlRedhaei, A.; Takruri, M. Real-time machine learning-based driver drowsiness detection using visual features. J. Imaging 2023, 9, 91. [Google Scholar] [CrossRef] [PubMed]
- Jahan, I.; Uddin, K.A.; Murad, S.A.; Miah, M.S.U.; Khan, T.Z.; Masud, M.; Aljahdali, S.; Bairagi, A.K. 4D: A real-time driver drowsiness detector using deep learning. Electronics 2023, 12, 235. [Google Scholar] [CrossRef]
- Salem, D.; Waleed, M. Drowsiness detection in real-time via convolutional neural networks and transfer learning. J. Eng. Appl. Sci. 2024, 71, 122. [Google Scholar] [CrossRef]
- Jarndal, A.; Tawfik, H.; Siam, A.I.; Alsyouf, I.; Cheaitou, A. A real-time vision transformers-based system for enhanced driver drowsiness detection and vehicle safety. IEEE Access 2024, 13, 1790–1803. [Google Scholar] [CrossRef]
- Zhang, Z.; Ning, H.; Zhou, F. A systematic survey of driving fatigue monitoring. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19999–20020. [Google Scholar] [CrossRef]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
- Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-shot multi-level face localisation in the wild. IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020; pp. 5203–5212. [Google Scholar]
- Guo, X.; Li, S.; Yu, J.; Zhang, J.; Ma, J.; Ma, L.; Liu, W.; Ling, H. PFLD: A practical facial landmark detector. arXiv 2019, arXiv:1902.10859. [Google Scholar] [CrossRef]
- Xiao, W.; Liu, H.; Ma, Z.; Chen, W.; Sun, C.; Shi, B. Fatigue driving recognition method based on multi-scale facial landmark detector. Electronics 2022, 11, 4103. [Google Scholar] [CrossRef]
- Makhmudov, F.; Turimov, D.; Xamidov, M.; Nazarov, F.; Cho, Y.I. Real-time fatigue detection algorithms using machine learning for yawning and eye state. Sensors 2024, 24, 7810. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Stancin, I.; Zeba, M.Z.; Friganovic, K.; Cifrek, M.; Jovic, A. Information on drivers’ sex improves EEG-based drowsiness detection model. Appl. Sci. 2022, 12, 8146. [Google Scholar] [CrossRef]
- Clavell, G.G.; González-Sendino, R.; Vazquez, P. Demographic benchmarking: Bridging socio-technical gaps in bias detection. arXiv 2025, arXiv:2501.15985. [Google Scholar] [CrossRef]
- Deng, W.; Wu, R. Real-time driver-drowsiness detection system using facial features. IEEE Access 2019, 7, 118727–118738. [Google Scholar] [CrossRef]
- Guo, J.M.; Markoni, H. Driver drowsiness detection using hybrid convolutional neural network and long short-term memory. Multimed. Tools Appl. 2019, 78, 29059–29087. [Google Scholar] [CrossRef]
- Maior, C.B.S.; das Chagas Moura, M.J.; Santana, J.M.M.; Lins, I.D. Real-time classification for autonomous drowsiness detection using eye aspect ratio. Expert Syst. Appl. 2020, 158, 113505. [Google Scholar] [CrossRef]
- Fa, S.; Yang, X.; Han, S.; Feng, Z.; Chen, Y. Multi-scale spatial–temporal attention graph convolutional networks for driver fatigue detection. J. Vis. Commun. Image Represent. 2023, 93, 103826. [Google Scholar] [CrossRef]
- Xiao, W.; Liu, H.; Ma, Z.; Chen, W.; Hou, J. FPIRST: Fatigue driving recognition method based on feature parameter images and a residual Swin transformer. Sensors 2024, 24, 636. [Google Scholar] [CrossRef]
- Mate, P.; Patil, A.; Talhar, M.; Khade, A. Detection of driver drowsiness using transfer learning techniques. Multimed. Tools Appl. 2024, 83, 35237–35255. [Google Scholar] [CrossRef]
- Essahraui, S.; Lamaakal, I.; El Hamly, I.; Maleh, Y.; Ouahbi, I.; El Makkaoui, K.; Filali Bouami, M.; Pławiak, P.; Alfarraj, O.; Abd El-Latif, A.A. Real-time driver drowsiness detection using facial analysis and machine learning techniques. Sensors 2025, 25, 812. [Google Scholar] [CrossRef]
- Hassan, O.F.; Ibrahim, A.F.; Gomaa, A.; Makhlouf, M.; Hafiz, B. Real-time driver drowsiness detection using transformer architectures: A novel deep learning approach. Sci. Rep. 2025, 15, 17493. [Google Scholar] [CrossRef] [PubMed]
- Abd El-Nabi, S.; Ibrahim, A.F.; El-Rabaie, E.S.M.; Hassan, O.F.; Soliman, N.F.; Ramadan, K.F.; El-Shafai, W. Driver drowsiness detection using swin transformer and diffusion models for robust image denoising. IEEE Access 2025, 13, 71880–71907. [Google Scholar] [CrossRef]
- Rahmani, C.; Benlamoudi, A.; Bounab, Y.; Bekhouche, S.E.; Samai, D.; Dornaika, F.; Taleb, A.; Belhaouari, S.B. A Semi-supervised neural framework for real-time drowsiness detection using facial cues. IEEE Access 2026, 14, 12816–12836. [Google Scholar] [CrossRef]
- Thampi, L.L.; Neethu, C.T.; Reddy, A.K.; Khan, I.A.; Aswathy, M.A.; Kumar, A.; Kumar, S. Smart driver assistance: Real-time drowsiness detection leveraging facial cues with MediaPipe and OpenCV. Int. J. Intell. Transp. Syst. Res. 2026. [Google Scholar] [CrossRef]
- Bhanja, A.; Parhi, D.; Gajendra, D.; Sinha, K.; Sahoo, A.K. Driver drowsiness shield (DDSH): A real-time driver drowsiness detection system. Robomech J. 2025, 12, 1–11. [Google Scholar] [CrossRef]
- Abo-Zahhad, M.M.; Elghamrawy, S.; Hefny, A.A.; Elawady, Y.H. Early drowsiness detection model in autonomous vehicles using GAN and YOLO integration. Neural Comput. Appl. 2025, 37, 28353–28378. [Google Scholar] [CrossRef]
- Lin, L.; Wang, S.; Yang, J.; Wei, F. A multi-aware graph convolutional network for driver drowsiness detection. Knowl.-Based Syst. 2024, 305, 112643. [Google Scholar] [CrossRef]
- Gao, Z.; Duan, P.; Li, R.; Tong, Z. A hybrid GCN-LSTM model for driver drowsiness detection. In SPIE—Fourth International Conference on Signal Processing and Computer Science (SPCS 2023); Nayyar, A., Kolivand, H., Eds.; SPIE: Washington, DC, USA, 2023. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Azmi, M.M.B.M.; Zaman, F.H.K. Driver drowsiness detection using vision transformer. In 2024 IEEE 14th Symposium on Computer Applications & Industrial Electronics (ISCAIE); IEEE: Piscataway, NJ, USA, 2024; pp. 329–336. [Google Scholar]
- Phan, T.-C.; Phan, A.-C.; Nguyen, N.-H. A novel approach of drowsiness levels detection using Vis-Net combined with facial emotion. Syst. Soft Comput. 2025, 7, 200288. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning; PMLR: London, UK, 2021; pp. 10347–10357. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 558–567. [Google Scholar]
- Khan, S.S.; Sengupta, D.; Ghosh, A.; Chaudhuri, A. MTCNN++: A CNN-based face detection algorithm inspired by MTCNN. Vis. Comput. 2024, 40, 899–917. [Google Scholar] [CrossRef]
- Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual reasoning with a general conditioning layer. In AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2018; Volume 32. [Google Scholar]
- Weng, C.H.; Lai, Y.H.; Lai, S.H. Driver drowsiness detection via a hierarchical temporal deep belief network. In Computer Vision–ACCV 2016 Workshops; Revised Selected Papers, Part III 13; Springer: Cham, Switzerland, 2017; pp. 117–133. [Google Scholar]
- Park, S.; Pan, H.; Kang, S.; Yoo, C. Driver drowsiness detection system based on feature representation learning using various deep networks. In ACCV 2016 Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; pp. 154–164. [Google Scholar] [CrossRef]
- Shih, J.; Hsu, Y. MSTN: Multistage spatial–temporal network for driver drowsiness detection. In ACCV 2016 Workshops; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2017; pp. 146–153. [Google Scholar] [CrossRef]
- Dang, T.; Hoang, H.; Do, T.; Pham, V. A deep neural network for real-time driver drowsiness detection. IEICE Trans. Inf. Syst. 2019, 102, 1374–1383. [Google Scholar] [CrossRef]
- Lyu, S.; Yuan, J.; Chen, Y. Long-term multi-granularity deep framework for driver drowsiness detection. arXiv 2018, arXiv:1801.02325. [Google Scholar]
- Shen, J.; Wang, X.; Song, Y. Robust two-stream multi-feature network for driver drowsiness detection. arXiv 2020, arXiv:2010.06235. [Google Scholar]
- Yu, J.; Park, S.; Lee, S.; Jeon, M. Driver drowsiness detection using condition-adaptive representation learning framework. IEEE Trans. Intell. Transp. Syst. 2019, 20, 4206–4218. [Google Scholar] [CrossRef]
- Tüfekci, G.; Kayabaşı, A.; Akagündüz, E.; Ulusoy, I. Detecting driver drowsiness as an anomaly using LSTM autoencoders. In ECCV 2022 Workshops; Springer: Cham, Switzerland, 2023. [Google Scholar]







| Ref | Year | Model(s) | Accuracy | Strengths and Weaknesses | Dataset(s) |
|---|---|---|---|---|---|
| [19] | 2019 | DriCare: Multiple CNN-kernelized correlation filters with MTCNN | 92% (Avg.) | High accuracy under various conditions but performance decreases in low light or when driver wears glasses. | CelebA, YawDD |
| [20] | 2019 | Hybrid CNN-LSTM | 84.85% | Effectively captures spatiotemporal features; however, it relies heavily on handcrafted data preprocessing and manually defined skip intervals. | ACCV Drowsy |
| [21] | 2020 | EAR+SVM | 94.44% | Real-time with a short temporal window; however, the evidence base is small and narrow, only based on eye pattern, manual temporal processing via a vector. | DROZY |
| [2] | 2020 | FD-NN, TL-VGG16, TL-VGG19 | FD-NN: 98.15%, TL-VGG16: 95.45%, TL-VGG19: 95% | Effective in capturing fine-grained features for fatigue but depend on pretrained models and dataset is limited. | Self-prepared ZJU dataset |
| [3] | 2021 | Deep-CNN-based ensemble | 85% | Considers different types of features but low accuracy. | NTHU-DDD |
| [5] | 2023 | VGG16, VGG19, and 4D | VGG16: 95.93%, VGG19: 95.03%, 4D: 97.53% | Consistent results and good temporal analysis but may struggle under dynamic conditions as the dataset is clean and oversimplified. | MRL |
| [22] | 2023 | MSSTAGCN built on facial landmark graphs | 92.4% | Resilient to lighting changes, occlusions, and skin-tone differences, but relies on accurate OpenPose landmarks and errors can propagate. | NTHU-DDD |
| [23] | 2024 | FPIRST: Residual Swin Transformer | 96.40% | Captures fine-grained temporal facial features (eyes, mouth) but performance drops in complex scenarios. | HNUFD |
| [6] | 2024 | CNN, InceptionV3, MobileNetV2 | CNN: 96%, MobileNetV2: 97%, InceptionV3: 98% | Responsive model for real-time but the dataset is oversimplified including only open/closed eyes. | MRL |
| [13] | 2024 | VGG16 and CNN | VGG16: 95.85%, CNN: 96.45% | Combines Haar cascades and CNN feature extraction but may need more power in high-complexity tasks and tested on limited dataset. | YawDD, MRL |
| [24] | 2024 | VGG19 | 96.51% | Performs well in different lighting and environmental scenarios but lacks flexibility due to static network structure. | NTHU-DDD |
| [25] | 2025 | KNN, SVM, CNN, YOLOv5, YOLOv8, Faster R-CNN | KNN: 98.89% (UTA-RLDD); CNN: 99.97%; YOLOv5/YOLOv8: 99.5% | Real-time, achieved near-perfect accuracy using YOLO, but low performance on YawDD due to lighting/yawning variability, requires significant computational resources. | UTA-RLDD, NTHU-DDD, YawDD |
| [7] | 2025 | ViT-DDD | 98.89%, 99.4% | Implemented prototype of the model, real-time ViT pipeline that uses full face but used a subset of the data without applying a standard train–test split. | NTHU-DDD, UTA-RLDD |
| [26] | 2025 | Swin-T | MRL: 99.03%, NTHU-DDD: 98.76%, CEW: 100% | Explainability via CAM focusing on eye regions but only focused on eyes and evaluation mostly offline. | MRL Eye, CEW, NTHU-DDD |
| [27] | 2025 | Swin-T | Eye Blink: 99.82%, CEW: 99.94% | Better denoising capability; however, diffusion is computationally costly. | Eye Blink, CEW |
| [28] | 2026 | YOLOv8 + Swin-T | UTA-RLDD: 99.99%; YawDD: 99.34%; NTHU-DDD: 95.94% | Semi-supervised learning reduces labeling dependence; however, pseudo-label noise and face-detection failures can still propagate errors. | NTHU-DDD, YawDD, UTA-RLDD |
| [29] | 2026 | CNN + Dlib/MediaPipe + OpenCV | MRL Eye: 84.53%; YawDD: 96.42% | Compares learned CNN cues with classical landmark-based tracking, but CNN accuracy on MRL is moderate. | MRL Eye, YawDD |
| Predicted Class | |||
|---|---|---|---|
| Alert | Drowsy | ||
| (Negative) | (Positive) | ||
| Actual | Alert | TN | FP |
| Drowsy | FN | TP | |
| Gender Agnostic | Gender Aware | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Finetuned? | Prc | Rec | F1 | Acc | Prc | Rec | F1 | Acc |
| ResNet50-SS | N | 90.96 | 88.02 | 89.47 | 89.29 | 89.25 | 90.59 | 89.92 | 89.50 |
| ResNet50-SS | Y | 91.67 | 89.94 | 90.82 | 90.60 | 93.82 | 89.24 | 91.47 | 91.40 |
| ViT-SS | N | 90.40 | 84.41 | 87.30 | 87.30 | 88.08 | 91.86 | 89.93 | 89.36 |
| ViT-SS | Y | 91.71 | 90.41 | 91.05 | 90.81 | 92.92 | 89.93 | 91.40 | 91.25 |
| DeiT-SS | N | 83.71 | 81.16 | 82.42 | 81.95 | 84.74 | 85.79 | 85.26 | 84.67 |
| DeiT-SS | Y | 92.77 | 89.63 | 91.12 | 90.98 | 94.08 | 88.92 | 91.43 | 91.38 |
| SWT-DD-SS | N | 91.60 | 87.92 | 89.82 | 89.70 | 91.37 | 90.35 | 90.86 | 90.61 |
| SWT-DD-SS | Y | 93.38 | 89.93 | 91.62 | 91.50 | 94.87 | 89.08 | 91.89 | 91.87 |
| Gender Agnostic | Gender Aware | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Finetuned? | Prc | Rec | F1 | Acc | Prc | Rec | F1 | Acc |
| ResNet50-SS | N | 92.60 | 87.71 | 90.09 | 90.03 | 90.39 | 90.20 | 90.65 | 90.39 |
| ResNet50-SS | Y | 95.91 | 86.47 | 90.94 | 91.10 | 93.45 | 90.56 | 91.98 | 91.84 |
| ViT-SS | N | 89.97 | 88.56 | 89.26 | 88.99 | 88.08 | 91.86 | 89.93 | 89.36 |
| ViT-SS | Y | 91.58 | 89.45 | 90.50 | 90.11 | 92.74 | 90.90 | 91.81 | 91.62 |
| DeiT-SS | N | 82.94 | 90.25 | 86.44 | 85.08 | 85.68 | 82.73 | 84.18 | 83.93 |
| DeiT-SS | Y | 92.06 | 91.48 | 91.77 | 91.52 | 93.33 | 89.09 | 91.16 | 91.07 |
| SWT-DD-SS | N | 92.79 | 85.52 | 89.01 | 89.08 | 94.02 | 91.19 | 92.58 | 92.45 |
| SWT-DD-SS | Y | 95.09 | 88.88 | 91.88 | 91.88 | 93.44 | 94.74 | 94.08 | 93.84 |
| Gender Agnostic | Gender Aware | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | Finetuned? | Prc | Rec | F1 | Acc | Prc | Rec | F1 | Acc | PAI |
| ResNet50-2S | N | 82.61 | 90.56 | 86.40 | 85.28 | 89.66 | 83.87 | 86.67 | 86.67 | +1.39 |
| ResNet50-2S | Y | 90.38 | 84.42 | 87.30 | 87.31 | 91.29 | 87.21 | 89.20 | 89.10 | +1.79 |
| ViT-2S | N | 84.95 | 85.99 | 85.47 | 84.93 | 85.83 | 87.54 | 86.68 | 86.13 | +1.2 |
| ViT-2S | Y | 93.01 | 94.34 | 93.67 | 93.43 | 93.13 | 95.07 | 94.09 | 93.85 | +0.42 |
| DeiT-2S | N | 87.68 | 87.30 | 87.49 | 87.14 | 89.33 | 87.84 | 88.58 | 88.33 | +0.79 |
| DeiT-2S | Y | 92.22 | 92.85 | 92.53 | 92.28 | 94.47 | 92.07 | 93.26 | 93.14 | −0.45 |
| SWT-DD-2S | N | 89.41 | 88.77 | 89.09 | 88.80 | 88.97 | 91.08 | 90.01 | 89.59 | +0.79 |
| SWT-DD-2S | Y | 93.13 | 95.07 | 94.09 | 93.75 | 95.42 | 94.70 | 95.06 | 94.93 | +1.18 |
| SWT-DD-3S | Y | 95.24 | 95.32 | 95.28 | 95.12 | 95.98 | 95.21 | 95.59 | 95.47 | +0.35 |
| Ref | IMG | LR | GA | Pret | Freeze.B | Modal | Fusion | Head | H.Hid | H.Drop | Aug | #P | Size | L | Prc | Rec | F1 | Acc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| s41 | 96 | 0.0001 | True | True | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.47 | 96.2 | 92.56 | 94.34 | 94.28 |
| s42 | 96 | 0.001 | True | True | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.47 | 51.51 | 100.0 | 68.0 | 51.51 |
| s44 | 96 | 0.00001 | True | False | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.24 | 90.86 | 90.21 | 90.54 | 90.28 |
| s45 | 96 | 0.00001 | False | True | False | E+M | A | mlp | 384 | 0.25 | light | 57.7 | 220.68 | 0.48 | 94.1 | 94.76 | 94.43 | 94.24 |
| s46 | 96 | 0.00001 | False | True | False | E+M | S | mlp | 256 | 0.25 | light | 55.24 | 211.29 | 0.47 | 74.59 | 58.19 | 65.38 | 68.25 |
| s49 | 96 | 0.001 | False | True | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.54 | 0.47 | 51.51 | 100.0 | 68.0 | 51.51 |
| s50 | 96 | 0.00001 | False | False | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.54 | 0.47 | 93.49 | 91.45 | 92.46 | 92.31 |
| s51 | 96 | 0.00001 | False | False | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.54 | 0.24 | 92.05 | 89.32 | 90.67 | 90.53 |
| s52 | 128 | 0.00001 | True | True | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.96 | 94.87 | 94.48 | 94.67 | 94.52 |
| s55 | 128 | 0.00001 | True | False | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.96 | 91.34 | 93.62 | 92.47 | 92.14 |
| s56 | 128 | 0.00001 | True | False | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.48 | 93.41 | 90.41 | 91.89 | 91.77 |
| s57 | 160 | 0.00001 | True | True | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 1.17 | 95.82 | 93.66 | 94.73 | 94.63 |
| s58 | 160 | 0.0001 | True | True | False | E+M | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 1.17 | 93.7 | 94.6 | 94.15 | 93.94 |
| s61 | 160 | 0.00001 | True | False | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.58 | 92.48 | 92.3 | 92.39 | 92.17 |
| s62 | 160 | 0.00001 | True | False | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.58 | 92.53 | 88.84 | 90.65 | 90.56 |
| s1 | 224 | 0.0001 | False | True | False | E+M | C | linear | 0 | 0.0 | light | 55.04 | 210.55 | 1.75 | 95.16 | 93.03 | 94.08 | 93.97 |
| s3 | 224 | 0.0001 | False | True | False | E+M | A | mlp | 384 | 0.2 | heavy | 57.7 | 220.68 | 1.75 | 91.88 | 51.97 | 66.39 | 72.89 |
| s4 | 224 | 0.0001 | False | False | False | E+M | C | linear | 0 | 0.0 | light | 55.04 | 210.55 | 1.75 | 91.42 | 91.11 | 91.27 | 91.02 |
| s6 | 224 | 0.0001 | False | True | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.54 | 0.87 | 95.02 | 91.13 | 93.04 | 92.97 |
| s8 | 224 | 0.0001 | False | True | False | E+M | S | mlp | 512 | 0.2 | light | 55.43 | 212.04 | 1.75 | 94.96 | 93.97 | 94.46 | 94.32 |
| s15 | 224 | 0.0001 | False | True | True | E+M | C | mlp | 256 | 0.1 | light | 55.44 | 212.05 | 1.75 | 92.93 | 93.34 | 93.14 | 92.91 |
| s16 | 224 | 0.00001 | False | True | False | E+M | S | mlp | 256 | 0.05 | light | 55.24 | 211.29 | 1.75 | 95.33 | 93.81 | 94.57 | 94.45 |
| s12 | 224 | 0.00001 | False | True | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.54 | 0.87 | 94.83 | 93.24 | 94.03 | 93.9 |
| s18 | 224 | 0.0001 | True | True | False | E+M | S | mlp | 512 | 0.1 | light | 55.83 | 213.56 | 1.75 | 94.88 | 94.51 | 94.7 | 94.55 |
| s20 | 224 | 0.0001 | True | True | False | E+M | C | mlp | 256 | 0.05 | light | 55.83 | 213.56 | 1.75 | 94.87 | 93.76 | 94.31 | 94.17 |
| s24 | 224 | 0.0001 | True | True | False | E+M | S | mlp | 512 | 0.2 | light | 55.83 | 213.56 | 1.75 | 93.38 | 95.32 | 94.34 | 94.11 |
| s26 | 224 | 0.0001 | True | True | False | E+M | C | mlp | 256 | 0.05 | light | 55.83 | 213.56 | 1.75 | 95.56 | 92.43 | 93.97 | 93.89 |
| s27 | 224 | 0.0001 | True | True | False | E+M | C | mlp | 384 | 0.1 | light | 56.23 | 215.07 | 1.75 | 94.67 | 94.53 | 94.6 | 94.44 |
| s32 | 224 | 0.00001 | True | True | False | E+M | S | mlp | 256 | 0.05 | light | 55.44 | 212.05 | 1.75 | 95.42 | 94.7 | 95.06 | 94.93 |
| s36 | 224 | 0.00001 | True | False | False | E+M | S | linear | 0 | 0.0 | heavy | 55.04 | 210.56 | 1.75 | 77.45 | 53.26 | 63.12 | 67.94 |
| s38 | 224 | 0.00001 | True | True | False | E+M | A | mlp | 384 | 0.25 | light | 58.0 | 221.82 | 1.75 | 93.01 | 95.61 | 94.29 | 94.03 |
| s39 | 224 | 0.00001 | True | True | False | E+M | S | mlp | 256 | 0.25 | light | 55.44 | 212.05 | 1.75 | 75.9 | 64.87 | 69.95 | 71.29 |
| s23 | 224 | 0.00001 | True | False | False | E+M | C | linear | 0 | 0.0 | light | 55.05 | 210.58 | 1.75 | 94.06 | 93.28 | 93.67 | 93.5 |
| s28 | 224 | 0.00001 | True | True | False | E | S | linear | 0 | 0.0 | light | 55.04 | 210.56 | 0.87 | 94.16 | 94.42 | 94.29 | 94.11 |
| s13 | 256 | 0.0001 | False | True | False | E+M | C | linear | 0 | 0.0 | light | 55.04 | 210.55 | 2.93 | 93.65 | 93.15 | 93.4 | 93.21 |
| s14 | 256 | 0.0001 | False | True | False | E+M | S | mlp | 384 | 0.1 | light | 55.34 | 211.67 | 2.93 | 94.48 | 93.5 | 93.99 | 93.84 |
| s22 | 256 | 0.0001 | True | True | False | E+M | C | linear | 0 | 0.0 | light | 55.05 | 210.58 | 2.94 | 93.99 | 93.81 | 93.9 | 93.72 |
| s29 | 256 | 0.0001 | True | True | False | E+M | C | linear | 0 | 0.0 | light | 55.05 | 210.58 | 2.94 | 92.84 | 95.25 | 94.03 | 93.76 |
| s66 | 224 | 0.00001 | True | True | False | E+M+F | S | mlp | 224 | 0.05 | light | 57.44 | 217.08 | 3.57 | 95.98 | 95.21 | 95.59 | 95.47 |
| Metric | Male | Female | Difference/Ratio | Interpretation |
|---|---|---|---|---|
| TPR | 0.959 | 0.947 | EOD = 0.012 | Minor disparity |
| PPR | 0.515 | 0.510 | DPD = 0.005 | Acceptable bias |
| DIR | – | – | 0.990 | Within fair range |
| Brier Score | 0.034 | 0.040 | – | Well calibrated |
| Paper | Approach | Modality | Acc | Testing Data |
|---|---|---|---|---|
| [42] | CNN with late fusion | 3 streams (global, face, motion) | 73.06 | Eval |
| [43] | MSTN: CNN + LSTM + temporal smoothing | Face crops | 82.61 | Eval, Test |
| [45] | Multi-granularity CNN + fusion | Face patches | 90.05 | Eval |
| [44] | ConvCGRNN real-time | Frames | 84.81 | Eval |
| [47] | 3D-CNN + condition fusion | Full frame + face | 76.2 | Eval |
| [46] | Two-Stream Multi-Feature Net | RGB + flow + landmarks | 94.46 | Eval |
| [22] | Multi-scale spatio-temporal attention GCN | Face landmarks | 92.4 | Test (Land-marked) |
| [48] | ResNet-34 + LSTM | Face video | 87.40 | Eval |
| Ours | SWT-DD-2S | Eyes, mouth and gender | 94.93 | Eval |
| SWT-DD-3S | Eyes, mouth, face and gender | 95.47 | Eval |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nurnoby, M.F.; El-Alfy, E.-S.M. Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers. Appl. Sci. 2026, 16, 3353. https://doi.org/10.3390/app16073353
Nurnoby MF, El-Alfy E-SM. Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers. Applied Sciences. 2026; 16(7):3353. https://doi.org/10.3390/app16073353
Chicago/Turabian StyleNurnoby, M. Faisal, and El-Sayed M. El-Alfy. 2026. "Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers" Applied Sciences 16, no. 7: 3353. https://doi.org/10.3390/app16073353
APA StyleNurnoby, M. F., & El-Alfy, E.-S. M. (2026). Gender-Aware Driver Drowsiness Detection Using Multi-Stream Shifted-Window-Based Hierarchical Vision Transformers. Applied Sciences, 16(7), 3353. https://doi.org/10.3390/app16073353

