1. Introduction
In the increasingly congested and complex road environment, the frequency of traffic accidents and concerns about road safety are constantly escalating. During the driving process, drivers often engage in distraction behaviors such as making phone calls, drinking water, eating, and talking to passengers. These distraction behaviors will lead to slower reaction speed, decreased attention, and decreased perception of the environment, thereby increasing the probability of traffic accidents. Therefore, drivers’ distraction often constitutes a major contributing factor to traffic accidents [
1]. As a response to enhancing road safety and mitigating accidents, research focused on the detection of drivers’ distraction behaviors has garnered significant attention [
2]. The research in driver distraction detection aims to develop advanced technological approaches that can monitor and identify deviations from normal driver behavior, enabling timely alerts or warnings to drivers and reducing potential traffic safety risks. This research holds paramount significance in the realms of accident prevention, elevated road safety, and the safeguarding of passengers, pedestrians, and drivers [
3]. Moreover, through rigorous data analysis, we can gain a deeper understanding of driver behavior patterns, accident causes, and trends, providing strong support for improving traffic safety policies and driver training. In general, driver distraction detection is of great significance for improving road safety, preventing traffic accidents, and promoting the development of intelligent transportation systems [
4]. By reducing driver distraction, we can create a safer, more efficient, and sustainable road transportation environment [
5].
Intelligent driving is a driving paradigm that integrates advanced technologies such as artificial intelligence, machine learning, sensor systems, and connectivity to enhance the safety, efficiency, and autonomy of vehicles. It aims to improve the driving experience and reduce human intervention in various aspects of vehicle operation [
6]. Driver distraction detection is one of the most important research topics in the field of intelligent driving. In the early stages, researchers have proposed many driver distraction detection methods based on traditional machine learning algorithms to improve road safety [
7,
8]. These methods are usually trained based on distracted and normal driving behavior data samples, and then use the learned patterns to detect and classify the distraction behavior. These methods usually have good interpretability and can reveal key features that lead to abnormal judgments [
9]. Moreover, these methods usually do not require a large amount of data for training, so they are better suited for situations with very limited training data. However, there are also some shortcomings of these traditional machine learning-based driver distraction detection methods. Firstly, the feature selection and extraction process of these methods is relatively difficult, requiring domain expertise, and it is difficult for these methods to fully capture the driver’s behavior patterns. In addition, these methods have insufficient generalization performance when dealing with complex and ever-changing driving environments, making it difficult to adapt to complex situations that have not been seen during the training process.
Although traditional machine learning-based methods have certain advantages in driver distraction detection tasks, they also have some limitations in complex scenarios and higher accuracy requirements. With the development of deep learning, many methods based on deep neural networks are constantly emerging to overcome these limitations [
10,
11,
12]. These deep learning-based methods utilize the powerful feature learning ability of deep neural networks to automatically learn higher-level and abstract feature representations from training data to achieve driver distraction detection with high accuracy. The method based on deep learning can perform end-to-end learning, from raw data to final distraction behavior classification, reducing the need for feature engineering. Moreover, these methods have strong generalization ability and can be applied to various driving scenarios. The application of deep learning methods has greatly promoted the development of the field of driver distraction detection.
These deep learning methods can achieve good performance in driver distraction detection [
13]. However, there are still some limitations in these methods. On the one hand, most methods mainly achieve classification through the constraint of cross entropy loss. However, cross entropy loss can only obtain a classification hyperplane that separates samples of different categories, and cannot obtain features with high discrimination [
14]. Therefore, the classification accuracy of these methods is difficult to further improve. How to learn features with high discrimination and further improve classification accuracy is a very important challenge currently faced [
15]. In addition, the computational complexity of deep learning models is relatively high, requiring a large amount of training data [
16]. However, the current dataset in the field of driver distraction detection has a very limited amount of data, making it prone to severe overfitting during the training process. How to overcome overfitting is another important challenge of deep learning-based methods.
To address the aforementioned issues, in this paper, we propose a driver distraction detection method based on Swin Transformer and a highly discriminative feature learning strategy (ST-HDFL). Due to the large receptive field and powerful feature learning abilities of Swin Transformer, it has achieved performance beyond CNN in many studies [
17,
18,
19]. Therefore, the Swin Transformer is adopted for feature extraction in this paper. However, it is difficult to obtain features with high discrimination only through the constraint of classification loss. Therefore, inspired by [
14], a highly discriminative feature learning strategy is proposed in this paper, which includes the constraint of sample and center distance loss (SC loss) and the center vector shift process. Firstly, we initialize a center vector for each class of samples, then reduce the intra-class distance of the same class of samples by minimizing the distance between samples and their corresponding center vectors in the feature space, and improve the inter-class distance of samples of different classes through the center vector shift process. In addition, due to the limited amount of data in existing public datasets related to driver distraction detection, the data augmentation method based on image transformation is adopted to alleviate the overfitting problem. Different from other driver distraction detection methods, our method adopt the powerful feature learning ability of Swin Transformer for feature extraction from the input images, and further improves the discrimination of different class samples in the feature space through the constraint of SC loss and center vector shift strategy, thereby improving the accuracy of driver distraction detection. To evaluate the effectiveness of the proposed driver distraction detection method based on the ST-HDFL model, we have conducted a large number of experiments on the public datasets.
The contributions of this paper are summarized as follows:
Due to the powerful image feature learning ability of Swin Transformer, it is introduced to extract more representative features from the input images in this paper.
A novel highly discriminative feature learning strategy based on SC loss and center vector shift process is proposed in this paper.
To evaluate the effectiveness of the proposed driver distraction detection method, extensive experiments have been conducted on the famous public driver distraction detection datasets (AUC-DDD and State-Farm) in this paper.
The rest of this paper is structured as follows. In
Section 2, we reviewed the related works.
Section 3 provides a detailed introduction to the proposed driver distraction detection method based on the ST-HDFL model. The experimental dataset, data processing methods, and specific implementation details are introduced in
Section 4. Then, the experimental results are processed and analyzed in
Section 5. Finally, the research of this paper is summarized and reviewed in
Section 6.
3. Methodology
In this paper, we propose a driver distraction detection method based on the ST-HDFL model. The algorithm framework of the ST-HDFL is shown in
Figure 1. Firstly, the input image is divided into multiple small patches. Then, each patch is mapped into a feature vector through the linear embedding module. Next, the position bias is added to each feature vector based on the position of each patch in the input image. Then, the Swin Transformer encoder is used to extract features from the feature vector sequence after position encoding. Finally, a fully connected classifier is employed to classify the obtained feature vectors. Before the training process, a center vector is initialized for each class of samples. During the training process, in addition to the constraint of classification loss, we also added the constraint of SC loss to reduce the intra-class distance between the same class of samples. And during each training process, the center vectors are updated through the center vector shift strategy to increase the inter-class distance of different classes of samples in the feature space. In this section, we will provide a detailed introduction to feature extraction through Swin Transformer and the highly discriminative feature learning strategy, respectively.
3.1. Feature Extraction through Swin Transformer
Swin Transformer is a deep learning model based on Transformer, which has excellent performance in visual tasks. The overall architecture of the tiny version of Swin Transformer is shown in
Figure 2. Different from Vision Transformer, Swin Transformer has the advantages of high accuracy and computational efficiency, and has been used as the backbone of many visual model architectures. Swin Transformer introduces two key concepts to solve the problems faced by the original Vision Transformer: hierarchical feature maps and shifted window attention.
- (1)
Hierarchical feature maps
The first significant difference from Vision Transformer is that Swin Transformer constructs hierarchical feature maps by gradually merging and downsampling, which allows it to better learn features at different scales. And the non-convolutional downsampling technique patch merging is used in Swin Transformer, which can effectively reduce the dimension of the feature map and reduce the computational complexity.
- (2)
Shifted window attention
The standard MSA used in Vision Transformer performs global self-attention, and the weight relationship between patches is calculated for all other patches. This leads to the squared complexity of the number of patches, making them unsuitable for high-resolution images. To address this issue, Swin Transformer used a window-based MSA method. A window is just a set of patches, and attention calculation is only performed within each window. Due to the fixed window size throughout the entire network, the complexity of window-based MSA is linearly related to the number of patches, which is a significant improvement over the standard MSA squared complexity.
However, window-based MSA has a significant drawback, which limits self-attention to each window and limits the ability of the network model. To address this issue, Swin Transformer used the shifted window MSA (SW-MSA) module after the W-MSA module. After the shifted operation, a window may consist of non-adjacent patches from the original feature map, so a mask was used in the calculation to limit self-attention to adjacent patches. This shifted window method introduces important cross-connections between windows, which has been proven to improve network performance.
The Swin Transformer replaces the multi-head self-attention (MSA) module of Vision Transformer with window MSA (W-MSA) and shifted window MSA (SW-MSA). The structure of the Swin Transformer block is shown in
Figure 2. Each Swin Transformer block consists of two subunits, and each of them consists of a normalization layer, an attention module, followed by another normalization layer and an MLP layer. The first subunit uses the W-MSA module, while the second subunit uses the SW-MSA module. The calculation process for each Swin Transformer block is as follows:
where
,
,
,
, and
denote the intermediate results of the calculation process, respectively.
Compared with convolutional neural networks, Transformer has the advantages of a large receptive field and high computational efficiency, and has a very broad application prospect in the future. Moreover, in computer vision tasks, Swin Transformer has made some optimizations and improvements on the basis of Vision Transformer, and the feature learning ability and computing efficiency have been significantly improved. Therefore, Swin Transformer is adopted for feature extraction from the input image in the driver distraction detection method proposed in this paper. Firstly, the input images
are split into some non-overlapping patches through a patch splitting module. Then, a linear embedding layer is applied to project each patch to a vector
, which is treated as a token. Next, in order to fully utilize the relative positional relationship information between different patches, a relative position bias is added to each token as follows:
where
is class token, and
denotes the position bias of each token. Finally, several Swin Transformer blocks are applied for feature learning from all the tokens (as shown in
Figure 1), and the class token of
is taken as the feature vector
, where
k is the number of Swin Transformer blocks.
3.2. Highly Discriminative Feature Learning Strategy
Optimizing the classification model solely by minimizing classification loss can only obtain a classification hyperplane to divide the input samples into different categories, and cannot obtain features with high discrimination. Therefore, the classification performance is limited. To further improve the accuracy of the proposed driver distraction detection model, a highly discriminative feature learning strategy is proposed in this paper, which includes the center vector initialization process, the constraint of SC loss, and the center vector shift process.
- (1)
Center vector initialization process.
Before the training process, a center vector needs to be randomly initialized for each class of samples. During the initialization process, the number of center vectors is the same as the number of sample categories in the experimental datasets, and the dimensions of each center vector are consistent with the feature vectors extracted by Swin Transformer.
- (2)
The constraint of SC loss.
In order to reduce the intra-class distance between the same classes of samples in the feature space, the constraint of SC loss is introduced into the proposed ST-HDFL model. The SC loss is the average distance between the feature vectors and their corresponding center vectors. During the training process, we promote the aggregation of the same class of samples to their corresponding center vectors through minimizing the SC loss (as shown in
Figure 3a), which can greatly reduce the intra-class distance of samples of the same class in the feature space. The calculation formula of SC loss is as follows:
where
denotes the feature vector of the
i-th sample in the
t-th iteration;
B denotes the number of samples in a batch;
is the label of the
i-th sample in the training batch;
represents the central vector of the
-th class of samples during the
t-th iteration.
- (3)
Center vector shift process.
Although the constraint of SC loss can promote each class of samples to cluster near the corresponding center vector, due to the insufficient distance between different center vectors, the discrimination among different classes of samples is usually not obvious. To improve the inter-class distance of different classes of samples in the feature space, the center vector shift process is adopted to the proposed ST-HDFL model. During the center vector shift process, we need to calculate the average vector
of all center vectors at first, then let each center vector shift along the direction of
by a certain step size (as shown in
Figure 3b). The mathematical formula of the central vector shift process can be described as
where
is the average vector of all central vectors during the
t-th iteration;
is the number of sample categories in the experimental datasets;
is the step size of the central vector shift process;
and
are the central vectors of samples with the label of
i before and after the central vector shift process in the
t-th iteration, respectively.