You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

30 April 2025

RMTSE: A Spatial-Channel Dual Attention Network for Driver Distraction Recognition

,
,
,
,
and
1
Faculty of Information Science, Huxi Campus, Chongqing University, Chongqing 400044, China
2
State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing 400023, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Sensors and Sensing Technologies for Traffic, Driving and Transportation

Abstract

Driver distraction has become a critical factor in traffic accidents, necessitating accurate behavior recognition for road safety. However, existing methods still suffer from limitations such as low accuracy in recognizing drivers’ localized actions and difficulties in distinguishing subtle differences between different behaviors. This paper proposes RMTSE, a hybrid attention model, to enhance driver distraction recognition. The framework introduces a Manhattan Self-Attention Squeeze-and-Excitation (MaSA-SE) module that combines spatial self-attention with channel attention mechanisms. This integration enables simultaneous enhancement of discriminative features and suppression of irrelevant characteristics in driving behavior images, improving learning efficiency through focused feature extraction. We also propose to employ a transfer learning strategy utilizing pre-trained weights during the training process, which further accelerates model convergence and enhances feature generalization. The model achieves Top-1 accuracies of 99.82% and 94.95% on SFD3 and 100-Driver datasets, respectively, with minimal parameter increments, outperforming existing state-of-the-art methods.

1. Introduction

According to the definition by the US-EU Bilateral ITS Technical Task Force [1], “driver distraction” refers to “the diversion of attention from activities critical for safe driving to a competing activity”. It is widely acknowledged that driver distraction significantly increases the risk of traffic accidents [1]. The advancement of intelligent technologies has introduced additional factors contributing to driver distraction, such as mobile phone usage for calls or text messaging while driving. Data from the World Health Organization’s 2023 Global Status Report on Road Safety [2] reveal an estimated 1.19 million road traffic fatalities in 2021. Therefore, developing real-time recognition of distracted driving behaviors coupled with effective assessment and warning mechanisms holds substantial potential for reducing accident rates caused by driver distraction and enhancing safety for both drivers and passengers.
The comprehensive survey by Tan et al. [3] demonstrates that current primary methods for driver distraction recognition can be categorized into (1) vision-based methods and (2) non-visual data integration and analysis methods. Among these, research utilizing RGB modal data collected by in-vehicle cameras constitutes a crucial component of vision-based approaches, benefiting from advantages including ease of data collection, low implementation costs, and rich semantic/contour information extraction capabilities. Recent advancements in deep learning have empowered improved Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Faster Region-based Convolutional Network (Faster R-CNN) models as robust tools for visual recognition tasks, leading to numerous studies on driver distraction recognition, exemplified by FRNet [4], IDC-Bi-LSTM-Att [5], CAT-CapsNet [6], and DD-RCNN [7]. However, existing deep learning-based approaches still face limitations such as inadequate global feature capture and non-parallelizable training processes. The Vision Transformer (ViT), leveraging its unique global modeling mechanism, fully utilizes GPU parallel computing capabilities while effectively capturing global features and inter-feature relationships. This architecture has demonstrated impressive performance on driver distraction recognition datasets, as evidenced by studies including [8,9]. However, traditional ViTs lack the inductive biases inherent in CNNs, thus requiring large amounts of data for learning. This results in ViTs generally underperforming CNNs on smaller datasets. To address this, it is necessary to introduce additional mechanisms that explicitly model ViT’s strong focus on specific regional features critical to task objectives, thereby enhancing the model’s learning efficiency when trained with equivalent data inputs. Meanwhile, compared to lightweight models, ViT requires longer training time to achieve acceptable recognition accuracy, which not only increases the demand for computational resources but also makes deployment more challenging in practical applications. Specifically, the deployment of ViT-based systems in real-world vehicular environments faces critical challenges including but not limited to the following: (1) hardware compatibility constraints for embedded deployment in vehicles with limited computational power, (2) real-time processing requirements for timely distraction alerts, and (3) significant dependency on large-scale diverse training data to maintain robustness across varying demographic characteristics and driving environments. Therefore, how to reduce training time while ensuring the model’s generalization performance and improve the efficiency of utilizing driver distraction behavior data under equivalent parameter counts remains crucial for enhancing the practicality of ViT-based methods in real-world tasks.
The contributions of this paper are summarized as follows:
(1) To achieve superior recognition accuracy on popular large-scale driver distraction datasets, we propose the RMTSE model, which integrates a decomposable self-attention mechanism with explicit spatial priors and a channel attention mechanism (CAM). This design endows the model with robust global spatial information modeling capabilities while enhancing its focus on inter-channel information flow through channel weight recalibration.
(2) We introduce a transfer learning strategy for driver distraction recognition tasks. This approach enables the model to acquire extensive low-level and general feature representations from larger datasets, which are unattainable when relying solely on driver distraction-specific datasets. Subsequent fine-tuning further empowers the model to achieve exceptional characterization of overall driver behavioral features. By leveraging transferable pre-trained weights, the proposed model attains competitive accuracy within a short training period and supports an extension to other driver distraction recognition datasets.
(3) We conduct comprehensive performance comparisons with state-of-the-art models implemented using mainstream methods on identical datasets. The RMTSE model demonstrates outstanding results on both the SFD3 and 100-Driver datasets, achieving higher average accuracy than numerous advanced approaches. These empirical outcomes validate the superiority and feasibility of the proposed method.

3. Methodology

The proposed RMTSE architecture aims to achieve precise driver distraction recognition. The effectiveness of integrating channel attention mechanisms into ViT stems from their ability to model nonlinear relationships between feature channels, complementing the self-attention mechanism. While ViT achieves global spatial dependency modeling through self-attention mechanisms, its core computational process fundamentally performs indiscriminate aggregation of features across all spatial positions, which lacks explicit evaluation of feature significance along the channel dimension. The output features from each ViT block often exhibit information redundancy across channels, where certain channels may contain task-irrelevant noise or secondary characteristics. SE-Net addresses this through adaptive channel weight learning, performing channel-wise recalibration of intermediate features to optimize information structure for subsequent layers. Particularly in deep networks, as inter-channel disparities in high-level semantic features intensify, SE-Net’s channel attention mechanism enhances category-relevant feature responses, enabling more efficient computational resource allocation during feature processing. A detailed schematic of the RMTSE model is shown in the accompanying Figure 1. The convolutional backbone initially extracts low-level feature information such as edges and basic shapes from input images. The MaSA-SE module prioritizes interactions between global spatial features and channel-wise information. Within this module, decomposable self-attention with Manhattan distance-based decay cooperates with a learnable channel attention mechanism, first modeling spatial dependencies and subsequently reassigning feature map importance across channels. When specific channels contribute minimally to feature representation, their information may be suppressed by the channel attention mechanism.
Figure 1. The feature construction process of RMTSE. The raw image is first fed into a convolutional backbone to generate preliminary feature maps, then sequentially processed through four consecutive stages. MaSA, within different stages, adopts decomposed or non-decomposed (original) forms depending on specific requirements.

3.1. Overall Architecture

Figure 1 illustrates the overall architecture of RMTSE. RGB images are first processed through a convolutional backbone containing four 3 × 3 convolutional layers to capture preliminary features and perform downsampling. The resultant feature maps are then fed into four consecutive stages for hierarchical feature construction from low-level to high-level representations. Each stage contains N MaSA-SE modules followed by a 3 × 3 convolutional layer with stride 2 for further downsampling. Within each MaSA-SE module, decomposable self-attention with customized decay computation is implemented. A channel attention module is positioned at the module’s end to holistically recalibrate the output. To reduce computational complexity, decomposed MaSA is applied in the first two stages, while the original MaSA operates in the latter two stages to enhance complex feature extraction. Finally, the output features are passed to a classifier for prediction.

3.2. MaSA-SE Module

Traditional Transformer-style attention mechanisms lack CNN-like inductive biases for local features, rendering the model incapable of distinguishing differences between image patches. To address this, positional encoding (PE) must be explicitly introduced. Notably, in driver behavior images, the contributions of different spatial regions to distraction recognition vary significantly. Due to camera perspective constraints, critical regions for distinguishing driver features are often concentrated in limited areas. Even within these regions, distinct features exhibit varying degrees of influence on specific behavior recognition. To ensure the model captures essential features, we aim to focus attention on regions more conducive to final classification. For this purpose, we propose integrating a channel attention mechanism to further amplify or suppress information at specific locations based on PE, thereby enabling the model to enhance the learning capacity of key features.

3.2.1. Manhattan Distance-Decayed Self-Attention

Manhattan Self-Attention (MaSA), initially proposed in RMT [24], introduces an attention mask that decays according to the Manhattan distance between pixel pairs. Compared to conventional self-attention mechanisms that indiscriminately process embedded 2D patches, MaSA incorporates richer spatial priors while maintaining linear computational complexity for global information modeling. The original MaSA can be formulated as Equation (1):
MaSA ( X ) = Softmax Q K T D 2 d V
where D 2 d denotes the decay matrix computed based on Manhattan distances between token pairs in the 2D plane, defined as:
D n m 2 d = γ x n x m + y n y m
The decomposed MaSA is expressed as Equation (3). By recalculating decay matrices along horizontal ( D n m H = γ y n y m ) and vertical ( D n m W = γ x n x m ) directions
A t t n H = Softmax Q H K H T D H , A t t n W = Softmax Q W K W T D W , MaSA ( X ) = A t t n H A t t n W V T
Each self-attention module in MaSA-SE employs Locally-Enhanced Positional Encoding (LePE) [23]. Unlike the absolute positional encoding in the original ViT [21], LePE utilizes convolutional operations to capture relative positional information, which is directly applied to the output of the self-attention module to enhance spatial awareness. The feature maps processed by MaSA are subsequently fed into Feed-Forward Networks (FFN) for further refinement.

3.2.2. Learnable Channel Weight Recalibration Module

At the end of each MaSA-SE module, we embed an SE-Net [27] module to capture channel attention from the input feature maps. This module compresses spatial dimension information via global average pooling to generate channel descriptors, followed by a two-layer fully connected (FC) network with nonlinear activation to model inter-channel dependencies. Finally, adaptive weights dynamically recalibrate feature channels. Specifically, the module comprises “Squeeze” and “Excitation” phases. For an input feature map X R H × W × C , the Squeeze operation is formulated as Equation (4):
z c = 1 H × W i = 1 H j = 1 W u c ( i , j )
The Excitation operation follows:
s = σ W 2 δ W 1 z
x ˜ c = s c · u c
where W 1 R C × C / r , W 2 R C / r × C , r = 16 (reduction ratio), δ ( · ) denotes ReLU, and σ ( · ) denotes sigmoid. Although this module introduces minimal parameters, it delivers substantial performance improvements, as demonstrated in the experimental section.

3.2.3. Depthwise Separable Convolution and Regularization

As illustrated in the figure, each MaSA-SE module begins with a Depthwise Separable Convolution (DwConv) layer to preliminarily process input feature maps for patch embedding. The inductive bias inherent in convolutional layers enables the model to better retain local spatial information. Unlike standard convolutions, DwConv decomposes the convolution process into Depthwise Convolution and Pointwise Convolution, which reduces computational complexity and parameter count while enhancing model efficiency and inference speed. Furthermore, Layer Normalization (applied within the FFN module) and DropPath are integrated into the MaSA-SE module. At the end of each stage, Batch Normalization (implemented in the downsampling module) is adopted for model regularization. These mechanisms collectively improve generalization capability and ensure training stability.

3.3. Transfer Learning Strategy

Transfer learning assumes that different tasks share inherent correlations, enabling knowledge acquired from one task to be transferred to another related task. Through pre-training, models can learn rich feature representations from large-scale datasets. In driver distraction behavior recognition tasks, fine-tuning allows the model to adjust its weights according to the target task’s data distribution, effectively leveraging knowledge from pre-training to accelerate convergence and enhance performance. ImageNet-1k [28], the most widely used subset of the ImageNet dataset, contains 1000 categories covering common objects in daily life (e.g., animals, vehicles, tools, natural scenes). These objects exhibit abundant contour and structural information, enabling models pre-trained on this dataset to develop robust capabilities in constructing and expressing low-level visual features and general patterns—attributes critical for characterizing holistic driver distraction behaviors.
Consequently, we adopt ImageNet-1k as the pre-training dataset. Specifically, we initialize the RMTSE model (excluding the SE-Net modules and classification head) with weights pre-trained on ImageNet-1k using the RMT-T [24] architecture. Since the final output for driver distraction recognition datasets involves 10 classes, we redesign and initialize the classification head. The SE-Net modules, characterized by minimal parameters and a focus on inter-channel relationships within input feature maps, are trained alongside the model during fine-tuning. For the SFD3 dataset, to enhance efficiency and consider that the initial three stages of the backbone primarily extract low-level features, we freeze all parameters in the convolutional backbone and the first three stages (except SE-Net modules). Only the SE-Net modules, the final stage (employing original MaSA), and the classification head are trained from scratch. For the 100-Driver dataset, all parameters are unfrozen to facilitate fair comparisons with other methods. The cross-entropy loss function is adopted for optimization.

4. Experiments

4.1. Dataset Description

The State Farm Distracted Driver Detection (SFD3) dataset [29], provided for a Kaggle competition, aims to identify driver distraction behaviors through computer vision techniques. It contains 22,424 annotated images categorized into 10 classes: c0: Safe driving, c1: Texting with right hand, c2: Calling with the right hand, c3: Texting with left hand, c4: Calling with the left hand, c5: Operating the radio, c6: Drinking, c7: Reaching behind, c8: Adjusting hair or makeup, c9: Talking to passengers. All images were captured by a single-view camera mounted in a vehicle from a single perspective, with a resolution of 640 × 480 pixels in RGB color format.
The 100-Driver dataset [30] is designed for the Distracted Driver Classification (DDC) task. It comprises over 470,000 driving behavior images from 100 drivers captured by four cameras, with a resolution of 1920 × 1080 pixels and 22 distinct driver behavior classes. This dataset features cross-modal, cross-vehicle, and cross-view characteristics. Compared to SFD3, the 100-driver dataset exhibits greater diversity in lighting conditions and camera perspectives, introducing additional challenges for accurate distraction recognition.
Example images from the SFD3 and 100-driver datasets are shown in Figure 2 and Figure 3, respectively. A detailed comparison of class distributions is provided in Table 1.
Figure 2. Example images of ten classes from SFD3.
Figure 3. Example images of ten classes from 100-Driver.
Table 1. Driver behavior classes and datasets.

4.2. Experiment Setting

During implementation, the deep learning framework PyTorch 2.5.1 was employed to construct the model. All experiments were conducted on Linux operating systems. For the SFD3 experiments, the system utilized an Intel Core i9-10980XE CPU @ 3.00 GHz and an NVIDIA GeForce RTX 3090 GPU with 24 GB of VRAM, both of which source from Chongqing, China. Experiments on the 100-Driver dataset were performed on a system equipped with an Intel Xeon Platinum 8481C CPU and an NVIDIA GeForce RTX 4090 D GPU with 24 GB of VRAM, both of which source from Chongqing, China. Input images were uniformly resized to 224 × 224 pixels, with a batch size of 64 and an initial learning rate of 1 × 10 6 . The AdamW optimizer was adopted for training. All results were obtained after training for 50 epochs, with data augmentation techniques applied. Specific hyperparameter configurations are detailed in Table 2.
Table 2. Hyperparameter configuration.

4.3. Experiment Results on SFD3

The SFD3 dataset was randomly partitioned into 80% for training and 20% for testing. The proposed model achieved a top-1 accuracy of 99.82% on the SFD3 dataset. Comparative analyses were conducted against various models employing CNN, LSTM, and ViT methodologies. Detailed performance metrics are presented in Table 3.
Table 3. Performance evaluation of various models on SFD3.
Figure 4 illustrates the confusion matrix of the proposed model on the SFD3 validation set. Notably, the model exhibits robust discrimination capabilities across all classes, demonstrating its strong potential for precise identification of driver distraction behaviors.
Figure 4. Confusion matrix of RMTSE on the SFD3 dataset. Only sporadic misclassifications are observed across a few classes.

4.4. Experiment Results on 100-Driver

The SFD3 dataset can be approximately regarded as a subset of the 100-Driver dataset in terms of driver behavior categories. Following the methodology of [33], we extracted 10 categories from the original 22 classes in 100-Driver to align with those in SFD3. Additionally, we utilized all daytime images from camera 4 (cam4), whose placement closely resembles the setup in SFD3. This approach introduces similarities in data categorization and feature distribution, facilitating comparative analysis with experiments on SFD3. The training and test sets were partitioned using the conventional 8:2 split ratio as specified in the 100-Driver study [30]. The proposed model achieved a 94.95% top-1 accuracy on the 100-Driver dataset. Comparative evaluations against multiple models demonstrate that RMTSE achieves significant advantages in accuracy, parameter count, and computational complexity over conventional ViT and traditional CNN-based methods. Compared to RMT-T, our proposed model achieves effective enhancement in recognition accuracy while introducing very few additional parameters. Detailed metrics are provided in Table 4.
Table 4. Performance evaluation of various models on 100-Driver.
In Figure 5, we present the confusion matrices of multiple models on the 100-Driver validation set. Analysis reveals that the proposed RMTSE model exhibits superior overall recognition capability across all categories compared to the other benchmarked models.
Figure 5. Confusion matrices of the selected models on the 100-Driver dataset. RMTSE demonstrates the highest recognition accuracy, while ResNet-101 and ViT-S exhibit higher confusion between c7 and c2, as well as c8 and c4. Inception-V3 shows the most severe misclassification. All models display significant confusion between c9 and c0.

4.5. Ablation Study

4.5.1. Ablation Study on SE-Net

As shown in Figure 6, compared to the original RMT-T without the SE-Net module, RMTSE exhibits faster convergence in training loss and achieves lower loss plateaus. The right part illustrates the Top-1 test accuracy trajectories of RMTSE and RMT-T across training epochs. It is evident that after several initial epochs, RMTSE consistently achieves higher top-1 test accuracy at equivalent epochs compared to RMT-T. These results validate that the SE-Net module effectively recalibrates features through its channel attention mechanism.
Figure 6. Comparison of training loss (top) and top-1 test accuracy (bottom) with and without the SE-Net module.

4.5.2. Ablation Study on Transfer Learning

Figure 7 presents a global comparison of training dynamics between the RMTSE model employing transfer learning (TL) and the model trained from scratch without pre-trained weights, evaluated through training loss and Top-1 test accuracy. Experimental results demonstrate that the TL-enabled model achieves convergence to 97.68% Top-1 test accuracy within 5 epochs. In contrast, the model without TL first exceeds 97% (reaching 97.13%) only at the 46th epoch. This indicates that the TL-enabled model bypasses the need to relearn elementary features and generic patterns. Instead, it leverages pre-trained knowledge of fundamental feature representations to directly assemble global characteristics of driver distraction behaviors. Compared to training from scratch, the TL approach significantly reduces training time while enhancing recognition accuracy for driver distraction behaviors.
Figure 7. Comparison of training loss (top) and top-1 test accuracy (bottom) with and without transfer learning (TL).

4.6. Model Visualization

To visualize the model’s attention and verify whether it focuses on semantically relevant image features, we employ Gradient-weighted Class Activation Mapping (Grad-CAM). Grad-CAM generates heatmaps to highlight regions in input images that most strongly influence predictions for specific classes. Figure 8 illustrates the attention patterns of four distinct models on example images from categories c0 to c5 in the 100-Drivers dataset. The six example images cover hand movements in different locations. For instance, in the c3 category (Texting with the left hand), ResNet-101’s attention shifts toward the central region rather than the critical left-hand area, while ViT-S disperses attention and erroneously captures features from the right hand. In contrast, RMTSE precisely localizes the left hand and mobile phone regions, aligning closely with the core characteristics of the behavior, while maintaining spatial separation from irrelevant regions. Across other examples, RMTSE consistently focuses on key limb regions with high inter-class discriminative power. Moreover, its activation maps exhibit spatial continuity, enabling effective capture of local contextual information around critical features, thereby enhancing differentiation from other classes.
Figure 8. GradCAM visualization results of selected models across different scenarios, where each row corresponds to a different model. RMTSE consistently exhibits continuous and precise attention to critical features in all scenarios.

5. Limitation

Although the RMTSE model demonstrates superior recognition performance on both datasets, analysis on the confusion matrix (Figure 5) reveals that all tested models exhibit pronounced confusion between c0 (Safe driving) and c9 (Talking to passengers) in 100-Driver dataset, a phenomenon not observed in the SFD3 dataset. Upon inspecting the training data, we identified that this discrepancy primarily stems from instances in the 100-Driver dataset where drivers in c9 lacked distinctive head-turning movements—a critical feature for distinguishing this class from others. Without this feature, the only discriminative characteristic between c9 and c0 becomes subtle differences such as slight mouth opening in certain frames. Such nuanced variations are inherently challenging for models to capture, resulting in substantial misclassification. A detailed comparative visualization is shown in Figure 9. Future work may address this limitation by integrating dedicated mechanisms for micro-expression detection to enhance discriminative capability.
Figure 9. Comparison of driver behaviors. From left to right: (a) c0: Safe driving; (b) Abnormal c9: Talking to passenger missing key features; (c) Standard c9 image. The absence of critical features and inter-image similarity is a primary cause of severe confusion.
Furthermore, while the spatial-channel dual attention mechanism and hierarchical feature extraction architecture proposed in this study effectively enhance recognition accuracy, they remain constrained by the inherent mechanisms and characteristics of the ViT model, resulting in demanding computational resource requirements. As shown in the Table 4, the RMTSE model still exhibits a computational complexity of 2.651 GFLOPs, which may pose compatibility challenges in real-time demanding scenarios such as in-vehicle embedded deployment. Although parameter efficiency optimization has been achieved through the SE-Net module, it remains necessary to strike a balance between accuracy and computational overhead in practical applications with limited edge computing resources. To address this limitation, the knowledge distillation techniques could be investigated in future research to transfer knowledge from larger-scale models to compact variants. This methodology aims to preserve comparable recognition performance while substantially reducing model complexity, thereby enhancing deployment adaptability for resource-constrained in-vehicle embedded computing systems.

6. Conclusions

The proposed RMTSE model achieves high-precision recognition of driver distraction behaviors through its spatial-channel dual attention mechanism and hierarchical feature extraction architecture. The incorporation of transfer learning further endows the method with rapid deployment capability and scalability, improves the recognition accuracy of driver distraction behavior, and reduces dependency on the data volume of driver distraction datasets. Extensive experiments on the SFD3 and 100-Driver datasets validate the model’s efficacy, achieving Top-1 accuracies of 99.82% and 94.95%, respectively, surpassing numerous state-of-the-art methods. The ablation study reveals that the SE-Net module enhances parameter efficiency by recalibrating features through channel-wise attention.

Author Contributions

Conceptualization, J.H. and Y.W.; methodology, J.H. and Y.W.; software, J.H., C.L. and W.Z.; validation, J.H. and Y.X.; formal analysis, J.H.; writing—original draft preparation, J.H.; writing—review and editing, J.H., Y.W., C.L., Y.X., H.L. and W.Z.; data curation, Y.X., H.L. and W.Z.; supervision, C.L., Y.X., H.L., W.Z. and Y.W.; funding acquisition, C.L. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Chongqing Natural Science Foundation Innovation and Development Joint Fund (Project NO. CSTB2024NSCQ-LZX0153).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request. All datasets used are publicly available.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Stavrinos, D.; Jones, J.L.; Garner, A.A.; Griffin, R.; Franklin, C.A.; Ball, D.; Welburn, S.C.; Ball, K.K.; Sisiopiku, V.P.; Fine, P.R. Impact of distracted driving on safety and traffic flow. Accid. Anal. Prev. 2013, 61, 63–70. [Google Scholar] [CrossRef] [PubMed]
  2. World Health Organization. Global Status Report on Road Safety 2023: Summary; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
  3. Tan, D.; Tian, W.; Wang, C.; Chen, L.; Xiong, L. Driver distraction behavior recognition for autonomous driving: Approaches, datasets and challenges. IEEE Trans. Intell. Veh. 2024. [Google Scholar] [CrossRef]
  4. Duan, C.; Gong, Y.; Liao, J.; Zhang, M.; Cao, L. FRNet: DCNN for real-time distracted driving detection toward embedded deployment. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9835–9848. [Google Scholar] [CrossRef]
  5. Wang, Z.; Yao, L. Recongnition of Distracted Driving Behavior Based on Improved Bi-LSTM Model and Attention Mechanism. IEEE Access 2024, 12, 67711–67725. [Google Scholar] [CrossRef]
  6. Mittal, H.; Verma, B. CAT-CapsNet: A convolutional and attention based capsule network to detect the driver’s distraction. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9561–9570. [Google Scholar] [CrossRef]
  7. Lu, M.; Hu, Y.; Lu, X. Driver action recognition using deformable and dilated faster R-CNN with optimized region proposals. Appl. Intell. 2020, 50, 1100–1111. [Google Scholar] [CrossRef]
  8. Ma, Y.; Wang, Z. ViT-DD: Multi-task vision transformer for semi-supervised driver distraction detection. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024; pp. 417–423. [Google Scholar]
  9. Chen, H.; Liu, H.; Feng, X.; Chen, H. Distracted driving recognition using vision transformer for human-machine co-driving. In Proceedings of the 2021 5th CAA International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, 29–31 October 2021; pp. 1–7. [Google Scholar]
  10. Li, N.; Jain, J.J.; Busso, C. Modeling of driver behavior in real world scenarios using multiple noninvasive sensors. IEEE Trans. Multimed. 2013, 15, 1213–1225. [Google Scholar] [CrossRef]
  11. Seshadri, K.; Juefei-Xu, F.; Pal, D.K.; Savvides, M.; Thor, C.P. Driver Cell Phone Usage Detection on Strategic Highway Research Program (SHRP2) Face View Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  12. SHRP2. Available online: http://www.trb.org/StrategicHighwayResearchProgram2SHRP2/Blank2.aspx (accessed on 24 February 2025).
  13. Alkinani, M.H.; Khan, W.Z.; Arshad, Q. Detecting human driver inattentive and aggressive driving behavior using deep learning: Recent advances, requirements and open challenges. IEEE Access 2020, 8, 105008–105030. [Google Scholar] [CrossRef]
  14. Qu, F.; Dang, N.; Furht, B.; Nojoumian, M. Comprehensive study of driver behavior monitoring systems using computer vision and machine learning techniques. J. Big Data 2024, 11, 32. [Google Scholar] [CrossRef]
  15. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  16. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  17. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  18. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  19. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  20. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  21. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  22. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  23. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
  24. Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5641–5651. [Google Scholar]
  25. Tang, X.; Chen, Y.; Ma, Y.; Yang, W.; Zhou, H.; Huang, J. A lightweight model combining convolutional neural network and Transformer for driver distraction recognition. Eng. Appl. Artif. Intell. 2024, 132, 107910. [Google Scholar] [CrossRef]
  26. Koay, H.V.; Chuah, J.H.; Chow, C.O. Shifted-window hierarchical vision transformer for distracted driver detection. In Proceedings of the 2021 IEEE Region 10 Symposium (TENSYMP), Jeju, Republic of Korea, 23–25 August 2021; pp. 1–7. [Google Scholar]
  27. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  28. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  29. SFD3. Available online: https://www.kaggle.com/competitions/state-farm-distracted-driver-detection (accessed on 12 January 2025).
  30. Wang, J.; Li, W.; Li, F.; Zhang, J.; Wu, Z.; Zhong, Z.; Sebe, N. 100-driver: A large-scale, diverse dataset for distracted driver classification. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7061–7072. [Google Scholar] [CrossRef]
  31. Behera, A.; Keidel, A.H. Latent body-pose guided densenet for recognizing driver’s fine-grained secondary activities. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 24–30 November 2018; pp. 1–6. [Google Scholar]
  32. Dhakate, K.R.; Dash, R. Distracted driver detection using stacking ensemble. In Proceedings of the 2020 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, 22–23 February 2020; pp. 1–5. [Google Scholar]
  33. Li, Z.; Zhao, X.; Wu, F.; Chen, D.; Wang, C. A Lightweight and Efficient Distracted Driver Detection Model Fusing Convolutional Neural Network and Vision Transformer. IEEE Trans. Intell. Transp. Syst. 2024. [Google Scholar] [CrossRef]
  34. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.