1. Introduction
Everyone has unique driving habits such as fixed speed, acceleration and braking habits, which could be considered as a fingerprint [
1]. Thus, drivers’ characteristics under driving conditions could be extracted through the analysis of driving behaviors. Considering different sources of data, we classify most current driving behavior identification models into three classes, that is, visual image or video-based, simulation data-based [
2,
3,
4] and CAN-BUS(Controller Area Network-BUS)/smartphone multi-sensors data-based [
5]. Among these, the visual data can be viewed as a special case of “multi-sensors data”, and the third one, which is more effective and favorable, is our focus in this paper. Specifically, we neglect analyzing visual data due to the poor amount of training data.
Generally, multi-sensors data are made up of in-vehicle’s CAN data and Smartphone data. The in-vehicle’s CAN data include the steering wheel, vehicle speed, engine speed, brake position, etc., while the smartphone data include speed, orientation, three-axis accelerometer, etc. Several works proposed driver identification methods based on in-vehicle’s CAN-BUS data [
1,
6,
7,
8]. In [
9,
10], deep sparse autoencoder (DSAE) was developed to extract hidden features for visualization of driving behavior, which was helpful to recognize distinctive driving behavior patterns in continuous data. Some researchers adopted the three-axis accelerometer of an Android-based smart phone to record and analyze various driver behaviors, external road conditions [
11], the degree of aggressiveness of each driver [
12], and accident detection [
13]. From the above works, it was concluded that driving pattern analysis is an efficient method for driver identification.
With the rapid development of Internet of Vehicles (IoV) technology and the popularization of smart terminal devices like car onboard diagnostic (OBD) devices, multi-dimensional CAN-BUS data can be easily captured for driving behavior recognition and vehicle owner identification. Driving behavior identification is essentially a classification task based on in-vehicle’s CAN-BUS data. It is important to choose key features from these driving data, and find the combination of features. For instance, driver A likes to accelerate quickly at startup while driver B is used to driving at a slow speed. However, previous works developed complex feature selection techniques to improve the performance of driving behavior identification. Among them, there exist several difficulties in manual feature combination. The first one is feature explosion difficulty, which is hard for experts to explore exhaustively, especially when the number of raw features is huge. The second one is that features are difficult to design, where part of the available training data has been desensitized due to individual privacy protection, leading to impossibility in simply performing feature engineering based on common sense. Third, combined features are difficult to identify and recognize, since generally most feature interactions are hidden behind numerous data and difficult to mine, which can only be captured automatically by machine learning. Fourth, the temporal dynamics of feature activations is difficult to model explicitly. Moreover, the issue of how to effectively train the model is also a challenge, since CAN-BUS data sometimes are massive and high-dimensional, therefore large feature space will lead to a growth of parameter number, increasing the complexity of model training.
Generally, the activity recognition or identification of drivers relies on the combinations of different CAN-BUS sensor data. However, traditional feature extraction methods for driving behavior identification adopt sliding window for static and periodic activities modeling [
14]. In driving behavior identification, complex high level behaviors (e.g., trip-long, day-long or more) are usually scaled up since engineering features are not related to “units of driving behaviors” but to the results of complex sequences of motor movements. However, in CAN-BUS data, multiple sensors yield multivariate time series, for instance, a single 3-axis accelerometer produces a 3-dimensional time series. Thus, it is desirable to consider the spatial dependency among multiple sensors or across the axes of accelerometers and gyroscopes, as well as the dependency along the temporal dimension.
In this paper, we propose a deep learning framework by fusing deep convolutional and recurrent neural network, denoted as attention-based DeepConvGRU and DeepConvLSTM respectively, for driving behavior identification. The outline of our approach is illustrated in
Figure 1. First, in-vehicle CAN-BUS sensor data are collected for each driver to characterize the drivers’ driving behaviors. Second, the time series CAN-BUS data are split into fragments by normalization and sliding window. Finally, the annotated data are fed into our proposed deep learning framework for driver behavior identification.
The main contributions are summarized as follows:
Our framework can perform automatic activity recognition on real-time multi-dimensional in-vehicle CAN-BUS sensor data, capturing local dependency among the data in temporal dimension as well as across spatial locations.
By introducing the attention mechanism, our model can capture salient structures of high-dimensional sensor data and explore the correlations among multi-channel sensor data for rich feature representations, improving the learning performance of the model.
Our framework can perform end-to-end training without any feature selection and work directly on the raw sensor data with simple pre-processing, making it universally applicable.
3. Results
Our model is evaluated and compared with other two methods, which are variants of our model created by removing the attention units from our model. It is also compared with some other state-of-the-art models [
27], which are described below:
DeepConvGRU–Attention: This model has two depth wise separable convolutional layers and a pooling layer in CNN module, followed by stacked GRU with two attention-based layers.
DeepConvLSTM–Attention: Compared to DeepConvGRU–Attention, this model replaces GRU with LSTM in the recurrent layers.
DeepConvGRU: This model is similar to DeepConvGRU–Attention without attention units in model training.
DeepConvLSTM: Similarly, this model removes attention units from DeepConvLSTM–Attention.
CNN: This model owns two depth wise separable convolutional layers and a pooling layer with a softmax classifier in the output. The baseline algorithm is used to verify the effectiveness of the recurrent layers in finding and exploiting long range dependencies in time series data, which is suitable for driving pattern recognition.
LSTM: This model has two stacked LSTM layers, referred to in [
37,
38].
DNN: This model has two stacked hidden layers, referred to in [
38].
Our model used all original 51D features to identify driving behaviors. To show the power of our end-to-end framework, feature selection referred to in [
27] was implemented, selecting 15-Dimensional (15D) features from the original 51D features and deriving three statistical features for original features. In total, statistical 45-Dimensional (45D) features were obtained.
Table 1 shows the selected original features and statistical features. We chose the KNN, Decision Tree and Random Forest algorithms as the baselines [
27,
38] as they have been proven to yield good performance.
The Ocslab driving dataset was split into a training set and a test set with a ratio of 7:3 for validating the model performance. To achieve the best performance for each model in the dataset, the parameters of models were fully tuned. The hyper-parameters of compared deep models are listed in
Table 2, which shows the structure of layers, window size, dropout, activation function and optimizer.
The window size
and
were set referred to [
39]. As shown in
Table 3, for different models, different features were chosen to get the best performance.
In training stage, the effects of the attention and RNN units were investigated in terms of model learning efficiency and generalization ability under Adam optimizer. The 5-fold cross-validation was used to make sure the proposed model was generalized over the dataset, in which the total data samples were divided into five parts, where four of them were used for the training model and the remaining one was employed for validation.
Figure 7,
Figure 8,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 illustrate the evaluations of the first fold training and the verification stage with respect to accuracy.
The legends of
Figure 7,
Figure 8 and
Figure 9 are identical. From
Figure 7a, LSTM and DNN could not be converged if using all the original features without performing any feature selection. Other models could automatically select features because of the convolutional layers. From
Figure 7b, DeepConvGRU and DeepConvLSTM gained better generalization ability, capturing local dependency among the temporal dimension compared with CNN. DeepConvGRU yielded faster learning efficiency than DeepConvLSTM because GRU has less parameters and therefore was easier to be converged. In
Figure 8 and
Figure 9, it can be seen that the attention based DeepConvGRU and DeepConvLSTM also performed the best compared with other models. The attention mechanism made the model easier to be converged.
From the results in
Figure 7,
Figure 8 and
Figure 9, we can see that the attention based DeepConvGRU/DeepConvLSTM consistently outperformed the baselines. It can be noticed that DeepConvGRU made a striking performance improvement. This may be because that LSTM has more parameters than GRU, which makes it more difficult to be converged on a small dataset. The fact that DeepConvGRU/DeepConvLSTM obtained better performance than CNN may be due to the ability of RNN cells to capture temporal dynamics within the data sequences. However, the baseline CNN was only capable of modelling time sequences up to the length of the kernels. Moreover, LSTM and DNN could not be converged if using all original features. So LSTM and DNN with selected 15D features and statistical 45D features were investigated and compared with other models in
Figure 10,
Figure 11 and
Figure 12.
The legends of
Figure 10,
Figure 11 and
Figure 12 are identical. From
Figure 10,
Figure 11 and
Figure 12, it can be seen that the attention based DeepConvGRU and DeepConvLSTM using original 51D features without any feature selection gained similar good performance to LSTM and DNN using artificially designed features. The baseline DNN using statistical 45D features yielded poor learning efficiency and generalization ability when setting
to 60 samples and
to 10 samples. Furthermore, the baseline DNN using selected 15D features could not be converged in all cases.
To fully show the performance comparison of the models, F1 scores of the models were explored except for the models that could not be converged. The results are shown in
Table 4.
Table 4 and
Table 5 illustrates the performance comparison of the proposed four variants of our framework compared with traditional models including CNN, LSTM, KNN, Decision Tree and Random Forest under different
and
. Experimental results showed that our framework outperformed traditional methods without any feature selection. Without feature selection, our framework also performed better than DNN and gained similar good performance to LSTM using artificially designed features. Moreover, the attention-based DeepConvGRU and DeepConvLSTM–Attention yielded better improvements than DeepConvGRU and DeepConvLSTM, respectively. In conclusion, the attention mechanism effectively helps to learn more discriminative features in time series data.
4. Discussion
From the performance comparison of our attention based DeepConvGRU/DeepConvLSTM with the baseline models without RNN unit and attention unit in the dense layer, several main findings were obtained.
First, DeepConvGRU/DeepConvLSTM reaches a higher F1 score. It is significantly more suitable for identifying disambiguate closely-related activities, which tend to differ with ordering time series data, and it is applicable for the activities that are longer than the observation window. The experimental results show that our framework can capture local dependency among the temporal dimension as well as across spatial locations.
Second, the attention mechanism makes DeepConvGRU/DeepConvLSTM gaining better generalization ability, which could automatically learn the weights of features and extract important features for the driving behavior identification.
Third, our framework outperforms traditional methods without any feature selection. Since CAN-BUS data sometimes are massive and high-dimensional, our framework is very advantageous in the case of difficult feature selection.
Furthermore, since the driving activity duration is longer than the sliding window size, experimental results showed that the model can nevertheless obtain a good performance. This might be because long driving activities are made of several short characteristic patterns, allowing the model to spot and classify the driving activity even without a complete view of the activity.
5. Conclusions
This paper presented a deep learning framework based on the combination of CNN and GRU/LSTM recurrent network to identify driving behaviors using in-vehicle CAN-BUS sensor data. In the framework, the GRU/LSTM cells were integrated into CNN to distinguish activities from similar driving behaviors. The attention based DeepConvGRU/DeepConvLSTM took advantage of learning temporal dynamics. Experimental results showed that our proposed method outperformed the traditional methods on the Ocslab driving dataset.
From the experimental results, it was also obvious that the proposed framework is able to learn features from original signals and fuse the learned features without any specific preprocessing. Surprisingly but reasonably, the attention-based DeepConvGRU achieved competitive F1 scores (0.984 and 0.970 respectively) while directly using 51-channel original sensor data. This provided a path to address a similar issue that sensor data from different sources must be automatically processed.
In the future, further researches can be conducted in the following aspects:
First, a multi-scale approach should be developed to achieve accurate activity recognition on in-vehicle CAN-BUS sensor data.
Second, due to the individual privacy protection of some driving datasets, most datasets do not disclose the complete time series data of driving behaviors from different drivers. Therefore, our framework can only be verified on a public driving behavior dataset. In the future, we need to investigate our model on more practical large-scale Naturalistic Driving Studies (NDS) datasets, such as 100-CAR [
40], SHRP2 NDS [
41,
42], etc.