1. Introduction
In recent years, as the number of vehicles continues to increase, traffic congestion on the roads has gradually worsened, and traffic accidents are frequent. Globally, the number of deaths due to road traffic accidents reached 1.35 million, and approximately 50 million people are injured in road traffic accidents every year [
1]. These accidents mainly include collisions between vehicles, vehicle–pedestrian collisions, and vehicle–non-motorized vehicle collisions. The accidents are mainly due to speeding, fatigued driving, and distracted driving. There are also accidents caused by external factors such as adverse weather conditions and poorly maintained roads. In light of the alarming data on traffic accident casualties, engineers are starting to seek new solutions.
An advanced driver assistance system (ADAS) is considered one of the most crucial ways to potentially decrease the risk of accidents and enhance road safety. An ADAS is an in-vehicle intelligent technology that typically utilizes sensors and cameras to gather vehicle information and provide warnings and recommendations to the driver. However, the ADAS supplied by current vehicles offers alerts based only on the vehicle’s operational status. The lack of environmental contextual information and driver behavioral data can result in delayed warnings that do not provide drivers with sufficient time to react [
2]. If an ADAS can predict the driver’s intentions a few seconds in advance, the system can prepare before the driver maneuvers the vehicle, assisting the driver in controlling the vehicle in time to avoid a collision.
In recent years, many studies on driving have emerged [
3,
4,
5]. These studies involved modeling using a variety of data, including vehicle dynamics data, driver state data, and external road environment data [
6,
7,
8,
9,
10]. Many studies used both driver state data and external road environment data [
11,
12,
13,
14]. They found that this method of combining the two types of data is more effective than using either type of data alone. Although these studies used multi-feature splicing and demonstrated its effectiveness in their studies, simple feature splicing does not handle the correlation between multiple features very well [
15,
16]. In addition, some methods try to use a 3D CNN to deal with time series data or data with a temporal dimension, such as video. However, this approach results in an excessive number of model parameters, which hinders its performance in practical applications.
Therefore, when conducting driver steering intention research, the number of model parameters needs to be reduced while maintaining accuracy in order to develop a more efficient and accurate driver steering intention recognition system. In this paper, an end-to-end dual-branch network (EDNet) is proposed, which integrates in-cockpit and out-of-cockpit data based on driver cognition. The main contributions of our paper are as follows:
For in-cockpit video data, we propose a novel driver intent feature extractor called Atrous-3DResNet 50 (A-3DResNet 50). It incorporates atrous convolution in the initial part of 3DResNet to lighten the model’s complexity while capturing a broader range of driver behaviors.
To effectively capture long-term dependencies in time series data, we designed the depthwise-separable max-pooling (DSMax) module and integrated it with ConvLSTM to form the ConvLSTM-4DSMax road environment feature extractor for vehicle operations. This enhancement improves the model’s processing of time series information, closely aligning with the requirements of the research task for such data.
Aiming to integrate the features of the video data, both inside and outside the cockpit, we propose a multi-feature fusion strategy and designed a feature fusion module called dynamic combined-feature attention fusion (D-CAF) based on the attention mechanism. This method effectively integrates various features both inside and outside the cockpit to enhance the accuracy of the model in classifying the driver’s steering intention.
The rest of the paper is organized as follows:
Section 2 reviews previous related work.
Section 3 explains the proposed methods and modules in detail.
Section 4 presents the dataset and discusses the experimental results.
Section 5 provides the conclusions of the paper, as well as suggestions for future work.
2. Related Work
Driver steering intention is a cognitively driven task that results from the interaction of the driver, the road environment, and the vehicle. The driver perceives information from the external road environment through vision and hearing, and evaluates incoming information before formulating intentions and making decisions.
Figure 1 displays the cognitive process of forming the driver’s steering intention.
In the study of driver steering intention, commonly used algorithms mainly include generative models, discriminative models, and deep learning methods. The task of recognizing driver steering intention is fundamentally a sequence prediction problem based on a time window. Driving behavior data within a specific time range is typically analyzed to predict and identify the driver’s steering intention. These driving behavior data cover the road environment information, vehicle motion state, and driver behavior.
Commonly used generative models include the hidden Markov model (HMM) and dynamic Bayesian network (DBN). Jain et al. [
6] proposed an autoregressive input–output-based HMM (AIO-HMM) to predict the driver’s intention by utilizing driver behavior and road environmental information. He et al. [
17] employed a dynamic Bayesian network to recognize the lane-changing behavior of surrounding vehicles in a highway scenario. They predicted the vehicle’s motion state and generated a trajectory based on this analysis. He et al. [
18] designed a double-layer HMM structure, where the bottom layer multi-dimensional Gaussian HMM (MGHMM) integrates driver behavior with the vehicle motion state to identify driving behaviors. The recognition results are then forwarded to the upper layer multi-dimensional discrete HMM (MDHMM) to determine the final driving intention. Zabihi et al. [
19] developed a maneuver prediction model using input–output HMM (IOHMM) to detect driver intentions by considering the vehicle’s motion state and driver behavior.
Discriminative models, such as an artificial neural network (ANN) and support vector machine (SVM), are also widely used in the field of driver steering intention recognition [
20]. Kim et al. [
21] utilized an ANN to analyze vehicle motion data, including the steering wheel angle, yaw rate, and throttle position, to classify the road conditions and predict the driver’s intention to change lanes. Leonhardt et al. [
22] employed an ANN approach to predict the driver’s intention to change lanes based on the driver’s behavior and vehicle motion state. Morris et al. [
23] developed a Bayesian extension of the popular SVM called a relevance vector machine (RVM) to distinguish between lane changes and lane keeping. On this basis, a real-time road prediction system was developed that is able to detect a driver’s intention to change lanes a few seconds before it occurs.
With the rapid development of deep learning technology, recurrent neural networks (RNNs) have become a research hotspot due to their advantages in dealing with time-dependent problems. A number of studies showed that long short-term memory (LSTM) networks outperform traditional machine learning and standard recurrent neural networks in terms of performance. They demonstrated their importance and superiority in modeling driver behavior [
24,
25,
26]. Zyner et al. [
27] proposed a recurrent neural network-based prediction method using LIDAR to acquire vehicle motion data to predict driver intent at roundabout intersections. Zhou et al. [
28] designed a cognitive fusion recurrent neural network (CF-RNN) based on a cognitively driven model and data-driven model, which consists of two LSTM branches to fuse road environment information and driver behavior information to predict driver intent. Kim et al. [
29] proposed a trajectory prediction method using LSTM that utilizes vehicle motion data collected from motorways to analyze temporal behavior and predict the location of surrounding vehicles. Transformer architectures also achieved significant results in areas such as image classification, trajectory prediction, and time series prediction [
30]. Gao et al. [
31] proposed a transformer-based integrated model that includes a lane change intention prediction model and a lane change trajectory prediction model. This model was designed to jointly predict lane change intentions and trajectories for vehicles operating in a mixed traffic environment, which involves both human-driven and autonomous vehicles. Chen et al. [
32] also proposed an intention-aware non-autoregressive transformer model with multi-attention learning based on the Transformer architecture to accurately predict multimodal vehicle trajectories.
In summary, the research field of driver steering intention covers a variety of algorithms, including generative models, discriminative models, and deep learning. These studies typically analyze multifaceted data, such as road environment information, the vehicle motion state, and driver behavior information, to accurately recognize and predict driver steering intentions.
Several studies utilized vehicle motion states to analyze driver steering intentions. However, vehicle operating states primarily stem from driving behaviors, which reflect how driver steering intentions manifest in the vehicle. Therefore, vehicle motion states are not suitable for directly recognizing driver steering intentions, especially when advanced prediction is required. In addition, most of the previous studies were conducted based on manually encoded features, and many of the models have large parameter counts, which is inconsistent with the requirement for lightweight criteria in real vehicle applications.
Therefore, taking into account the driver’s cognitive perspective, this work utilized an attentional mechanism to incorporate both in-cockpit features related to driver behavior and out-of-cockpit features concerning the road environment, with the aim to accurately identify the driver’s steering intention.
3. Method
3.1. Overall Framework
This paper proposes a model that leverages both in-cockpit and out-of-cockpit video data to detect the driver’s steering intention prior to vehicle maneuvering. For the in-cockpit video, the driver intent feature extractor A-3DResNet 50 was constructed based on 3DResNet and atrous convolution. For the out-of-cockpit video, the video data are converted to optical flow images, and the DSMax module was designed based on depthwise separable convolution. The DSMax module combines ConvLSTM to construct the road environment feature extractor ConvLSTM-4DSMax. The D-CAF module is used to combine these two features. Subsequently, a classifier was built, consisting of a fully connected layer and a softmax layer, to identify and predict five types of vehicle maneuvers: straight ahead (straight), left lane change (Lchange), left turn (Lturn), right lane change (Rchange), and right turn (Rturn).
The structure of EDNet is shown in
Figure 2. Considering the need for lightweighting the model and drawing inspiration from transfer learning, the freeze-training strategy was implemented for A-3DResNet 50 and ConvLSTM.
3.2. Driver Intent Feature Extractor
The 3DResNet excels in video data processing, especially for human action recognition [
33]. Its ability to handle video sequences of varying lengths and sizes makes it ideal for cockpit video data processing. In deep learning, several 3DResNet variants with different architectural depths are commonly used, including 3DResNet 18, 3DResNet 34, 3DResNet 50, 3DResNet 101, and 3DResNet 152. Each variant is tailored with distinct strengths and weaknesses, which become more significant depending on the task at hand.
The characterization and learning capabilities of the shallow network 3DResNet 18 are limited. Its effectiveness is not very satisfactory in handling complex and high-dimensional data due to its comparatively low learnable feature complexity. On the other hand, the computational resource needs and model complexity of the 3DResNet 152 deep network are much higher. This results in increased hardware resource demands and delays in the training and inference processes.
Hence, to strike a balance between the representation capacity, model complexity, and computational resource demands, we focused on the 3DResNet 34, 3DResNet 50, and 3DResNet 101 variants from the 3DResNet series. These networks offer significant representation capabilities while being relatively straightforward to train and requiring moderate computational resources.
Figure 3 displays the structures of 3DResNet 34, 3DResNet 50, and 3DResNet 101, which share a similar structure overall and are composed of five stages. Stage 1 incorporates convolutional and max-pooling layers for the spatial feature extraction and fusion of 3D data. Stages 2 to 5 include a varying number of residual blocks. The grey module represents the basic block, which is a residual block commonly employed in shallower ResNet. The blue module denotes the bottleneck block, which is utilized in deeper ResNet to reduce parameters, thereby effectively mitigating the model’s demand for graphics memory and computational resources.
Although 3D ResNet works well for processing 3D data, it still has challenges with processing huge video frames and capturing long-term temporal information. Thus, we built the A-3DResNet family of networks for driver intent feature extraction inside the cockpit by introducing atrous convolution in stage 1 of 3D ResNet.
Atrous convolutional operations expand the receptive field without increasing the feature map size. This enhances the capture of temporal and spatial information, preventing information loss. Two key considerations underpin this design:
Enhancing feature extraction: Atrous convolutions play a vital role in integrating both local and global information. Incorporating atrous convolutions in the initial stage of 3D ResNet enhances the feature extraction efficiency and reduces the information loss.
Reducing model complexity: introducing atrous convolution in the early stages of the model can limit the number of network parameters and computational complexity, thus preventing the addition of excessive computational burden in the subsequent stages.
Applied to the task of driver steering intention recognition, we utilized Atrous-3DResNet 50 (A-3DResNet 50) as a driver intent feature extractor within the cockpit. The model takes in 16 frames per second that are uniformly sampled from the video, with a frame image size of 112 × 112. The structure of A-3DResNet 50 is shown on the left side of
Figure 2.
Table 1 presents the parameter information for each module in the model. In the output size column, the first number of every line indicates the feature map channel count, and the remaining ones indicate the feature map spatial dimensions.
3.3. Road Environment Feature Extractor
Convolutional LSTM (ConvLSTM) is a deep learning model that combines CNN and RNN [
34]. It is frequently employed for handling spatiotemporal data in the format of data frames. We designed the DSMax module based on depthwise separable convolution. ConvLSTM and DSMax modules were combined to create ConvLSTM-4DSMax, which is a road environment feature extractor outside the cockpit.
3.3.1. Optical Flow Prediction
Optical flow characterizes the changes in the image and captures the motion information of the target in successive frames of the video. Therefore, we used FlowNet 2.0 [
35] to generate optical flow maps containing information about the relative motion between vehicles and other traffic participants as input into ConvLSTM.
ConvLSTM combines the efficient high-dimensional data extraction capabilities of a CNN with the memory function of LSTM, making it suitable for processing complex temporal data, such as speech and video. Its operation is encapsulated by the following equations:
where
represents the input gate,
represents the forget gate,
represents the cell output,
represents the output gate,
represents the hidden state,
represents the sigmoid function, ∗ represents the convolution operation, ⊙ represents bitwise multiplication, and ⊕ represents bitwise addition.
Figure 4 displays the structure of ConvLSTM. Its input is five frames, which are obtained by uniformly sampling from the video data, and the input size of each frame is 112 × 176.
3.3.2. DSMax Module
To overcome the computational inefficiencies of standard convolutions in deep networks, this study introduced the depthwise-separable max-pooling (DSMax) module, which combines depthwise separable convolution (DSC) and max pooling. The module not only accelerates the model training and inference but also provides optical flow features predicted by ConvLSTM for subsequent feature fusion.
DSC consists of depthwise convolution (DWConv) and pointwise convolution (PWConv). In DWConv, a convolution kernel is applied to perform a convolution operation on each channel of the input feature map. The output of each channel is then spliced to obtain the final output. PWConv actually refers to a 1 × 1 convolution, which serves two roles in the DSC:
Adjusting the number of channels: stand-alone DWConv cannot change the number of output channels; PWConv is used to adjust the number of output channels.
Implementing channel fusion: PWConv is used to perform channel fusion operations on the feature maps output from DWConv so as to effectively integrate feature information.
The DSMax module applies max pooling to the output of the DSC. It reduces the feature map dimensions while retaining important features. The detailed structure of the DSMax module is illustrated in
Figure 5.
3.3.3. ConvLSTM-4DSMax Model Implementation
The integration of ConvLSTM with four DSMax modules results in the ConvLSTM-4DSMax, which is an effective feature extractor for the road environment beyond the cockpit. ConvLSTM is employed to predict optical flow characteristics exterior to the cockpit, whereas the DSMax modules are responsible for the extraction and refinement of critical features from the data. The ConvLSTM-4DSMax can handle complex spatiotemporal data, enabling the successful extraction of deep feature correlations embedded within temporal sequences.
The structure of ConvLSTM-4DSMax is shown on the right side of
Figure 2. The parameter information of each module in ConvLSTM-4DSMax is shown in
Table 2. In the output size column, the first number of every line indicates the feature map channel count, and the remaining ones indicate the feature map dimensions.
3.4. D-CAF Module
Driver intention features (in-features) inside the cockpit are obtained by A-3DResNet 50, and road environment features (out-features) outside the cockpit are obtained by ConvLSTM-4DSMax. For these two features, a dynamic combined-feature attention fusion (D-CAF) module was designed in this study, and
Figure 6 shows the structure flow of the D-CAF module.
The features inside and outside the cockpit are horizontally spliced according to the column dimensions to obtain the combined features (InOut-features). Subsequently, attention weights (AWs) are calculated using linear layers and a sigmoid activation function. The process is defined as follows:
where
W is the weight of the linear layer,
X is the input feature vector, and
b is the bias.
Attention weights are obtained by applying a sigmoid activation function to map the values between 0 and 1. The weight matrix W and bias b are learned during the model-training process by continuously adjusting the values of the weight matrix to better model the input data. The combined features undergo an element-wise multiplication operation with the attention weights to obtain attention-weighted features (AW-Features). The D-CAF module achieves dynamic fusion of features both inside and outside the cockpit, enabling the model to adaptively focus on feature information from various parts and merge them.
3.5. Loss Function
The loss function is employed to quantify the disparity between the true intention of the driver and the intention recognized by the model. The smaller the difference, the more adept the model is at mapping inputs to outputs. In this study, the cross-entropy loss function was used in the in-cockpit experiments, and the mean squared error loss function was used in the out-of-cockpit experiments and the joint in-cockpit and out-cockpit experiments.
The cross-entropy loss function is commonly used in classification problems. In multi-classification problems, the expression of the cross-entropy loss function is shown below:
where
C represents the number of categories,
represents the value of the class,
i is the true labeling, and
represents the predicted probability.
The mean squared error loss function is used to calculate the mean of the sum of squares of the errors between the predicted data and the original data. The expression of the mean squared error loss function is shown below:
where
N represents the number of samples,
represents the predicted value, and
represents the target value.
3.6. Freeze-Training Strategy
In a multi-network model, each network is designed for a specific task, whereas the final model needs to consider all networks together to achieve the overall goal. In practice, each network performs well on individual tasks. However, when integrated into a unified framework, the results are not satisfactory. The main reason is that the features required for the joint task are overly complex. This complexity makes it challenging for the model to capture all the information during the learning process, ultimately impacting its performance.
To solve this problem, we employed the freezing strategy during training, which involves fixing the parameters of A-3DResNet 50 and ConvLSTM. This enables the model to concentrate its resources on training subsequent layers, leveraging knowledge from previous weight files. Compared with building the model from scratch, this method is more robust and mitigates overfitting risks, making it better suited for real-world applications.
5. Conclusions
In this paper, we propose a novel end-to-end dual-branch network (EDNet). EDNet uses A-3DResNet 50 as the driver intent feature extractor inside the cockpit, combines ConvLSTM and the depthwise-separable max-pooling (DSMax) module as the road environment feature extractor outside the cockpit, and then integrates the two types of features based on the dynamic combined-feature attention fusion (D-CAF) module. During the training of the networks, a freeze-training strategy is used for A-3DResNet and ConvLSTM. The experiments show that combining information from inside and outside the cockpit improved the driver steering prediction. Our method achieved impressive results: 85.6% accuracy and 86.2% F1-scores on the Brain4Cars dataset, 86.6% accuracy and 89.0% F1-score on the Zenodo dataset, and 86.3% accuracy and 87.8% F1-score on a combination of the two datasets, with a model parameter count of 11.88 M. The results show that the comprehensive performance of our method outperformed other methods in its class.
Despite the significant results achieved by the driver intention recognition method proposed in this paper, it still has some limitations. There is a relative lack of research data, with little publicly available video data recorded both inside and outside the cockpit. Additionally, real driving scenarios frequently involve improper behaviors, like driver distraction and conversing with others, which can confuse the recognition of the driver’s steering intention. Therefore, to better adapt the model to real driving scenarios and enhance its application in vehicle driving safety, we plan to collect data in future studies that are more representative of real driving environments, including driving behaviors such as driver distraction, conversation, and eating. We will analyze this data and refine the model to effectively distinguish these improper behaviors. We will also enhance the model’s accuracy in detecting drivers’ lane changing and turning intentions to meet the needs of real-world applications, and we are committed to making the model more lightweight for practical deployment. With these enhancements, driver behavior will be standardized, which is expected to reduce traffic accidents caused by driver error, including those resulting from improper driving behavior and misinterpretation of road conditions. Moreover, these improvements will be highly beneficial for autonomous vehicles, significantly enhancing their adaptability and driving safety.