Driver Behavior Analysis via Two-Stream Deep Convolutional Neural Network

: According to the World Health Organization global status report on road safety, tra ﬃ c accidents are the eighth leading cause of death in the world, and nearly one-ﬁfth of the tra ﬃ c accidents were cause by driver distractions. Inspired by the famous two-stream convolutional neural network (CNN) model, we propose a driver behavior analysis system using one spatial stream ConvNet to extract the spatial features and one temporal stream ConvNet to capture the driver’s motion information. Instead of using three-dimensional (3D) ConvNet, which would su ﬀ er from large parameters and the lack of a pre-trained model, two-dimensional (2D) ConvNet is used to construct the spatial and temporal ConvNet streams, and they were pre-trained by the large-scale ImageNet. In addition, in order to integrate di ﬀ erent modalities, the feature-level fusion methodology was applied, and a fusion network was designed to integrate the spatial and temporal features for further classiﬁcation. Moreover, a self-compiled dataset of 10 actions in the vehicle was established. According to the experimental results, the proposed system can increase the accuracy rate by nearly 30% compared to the two-stream CNN model with a score-level fusion.


Introduction
According to the World Health Organization (WHO) global status report on road safety, in 2018, traffic accidents were the eighth leading cause of death in the world [1]. Nearly 135 million people were killed and injured every year. According to the report, nearly one-fifth of the traffic accidents were cause by driver distractions [2]. Although the mortality rate of traffic accidents in the world declined in recent years, compared with other countries, the mortality rate is much higher in Taiwan. In addition, due to the development of science and technology in recent years, in-vehicle information systems (IVISs) such as navigation and media devices were installed in cars. These devices would introduce more driver distractions and lead to more accidents [2]. Road traffic accidents cause huge damage, and the number of accidents due to distractions is increasing. According to the National Highway Traffic Safety Administrator of the United States (NHTSA) reports [2], human errors caused approximately 90% road accidents in the United States and represented a dominant factor for vehicle crashes [3]. Among them, the major cause of these accidents was the use of mobile phones [4]. As a result, the government enacted several laws to punish drivers and prevent distractions caused by the use of hi-tech products while driving.
In fact, using a smartphone is not the only reason for driver distractions. NHTSA defines a distraction during driving as "any activity that diverts attention of the driver from the task of driving", including talking or texting on one's phone, eating and drinking, or talking to people in the vehicle [4]. The Centers for Disease Control and Prevention (CDC) [5] classified distracted driving as cognitive, Hence, inspired by the famous network architecture, using a two-stream CNN model [27] composed of a spatial and a temporal stream ConvNet, the spatial and temporal information was extracted using powerful CNN models for the classification of common 10 distracted behaviors inside the vehicle. For the spatial information, an average pooling in the temporal domain was performed on 10 consecutive RGB images, and then the resulting map was input to the following convolutional layers. The feature map from the last convolutional layer was taken as the spatial features for further processing. On the other hand, TVL 1 optical flow [28] was firstly applied to extract the horizontal and vertical motion information between input frames, and then two stacks of flow images were concatenated and input to the temporal stream ConvNet. The feature map from the last convolutional layer was extracted as the temporal features. Note that, instead of using a three-dimensional (3D) ConvNet, which would suffer from large parameters and the lack of a pre-trained model, two-dimensional (2D) ConvNet was used to construct the spatial and temporal ConvNet streams, and they were pre-trained using the large-scale ImageNet [12]. Moreover, in order to integrate different modalities, rather than using manually defined weights in the score-level fusion [27] or weights obtained via the optimization process [24], the feature-level methodology was applied and a fusion network was proposed to analyze the concatenated spatial and temporal features for classification.
The remainder of this paper is organized as follows: Section 2 reviews the issues of driver distraction; Section 3 presents the proposed system consisting of three modules, spatial deep network, temporal deep network, and integration module; Section 4 presents the experiment results of the proposed system and compares it with other systems; Section 5 concludes the paper.

Related Works
In the computer vision field, action recognition was developed for many years. Among these researches, how to extract action features is a key issue. In early years, studies focused on the hand-crafted local features that were designed by human experts to extract a given set of chosen characteristics [29]. Since temporal information includes important cues for action recognition, famous feature descriptors for image classification such as scale-invariant feature transform (SIFT) and histogram of oriented gradient (HOG) were extended to extract features for 3D data (2D spatial + one-dimensional (1D) temporal data). In 2003, Laptev et al. [30] proposed spatio-temporal interest points (STIPs) by extending Harris corner detectors. SIFT and HOG were also extended to SIFT-3D and HOG3D for action recognition. Dollar et al. [31] proposed the cuboid feature for behavior identification. Sadanand and Corso [32] established the ActionBank for action recognition. In 2013, Wang et al. [33] proposed improved dense trajectories (iDT). Although action recognition was studied for at least 20 years, the performance is still limited due to the difficulties including large appearance and pose variations, as well as a cluttered background [34]. With the development of convolutional neural networks (CNNs), CNN showed the advantages of performance improvement for action recognition compared to hand-crafted features [29]. It also makes real-life applications of action recognition possible. Among them, driver behavior analysis is a specific issue and becomes very important because in-vehicle information systems (IVISs) such as navigation and media devices are being developed rapidly but they introduce more distractions which cause accidents.
In early studies of driver distraction detection, motivated by the fact that the major cause of distractions is using a cellphone while driving [6], the detection of cellphone usage became a focus and many machine learning models were applied. For example, in 2011, Zhang et al. [7] extracted the features from the image captured by a camera installed on the dashboard and a hidden condition random field model was applied. Berri et al. [8] applied an SVM model to check the hands and face locations with the frontal image view of the driver. In order to cope with occlusion, Craye et al. [9] used the RGB-D data captured by Kinect sensors, but the distance between the sensor and the driver was missed in the data collection process. Unlike previous methods that assumed safe driving based on the location or state of the face or hands, Seshardi et al. [10] applied a supervised descent method to track the locations of face landmarks, and an AdaBoost classifier trained for each side of the face regions was used to classify the cellphone usage as right hand, left hand, or no usage.
According to the definition of a distraction by the NHTSA as "any activity that diverts attention of the driver from the task of driving" [4], the CDC [5] classified distracted driving as cognitive, visual, and manual distractions, whereby not only cellphone usage detection was developed but more distracted behaviors were also considered. In 2014, Martin et al. [35] collaborated with the University of California, San Diego (UCSD) Laboratory of Intelligent and Safe Automobiles and presented a vision-based hand activity analysis system to detect three types of distractions: adjusting video, adjusting mirrors, and operating gear. The images are captured by two Kinect cameras that can provide the frontal and back views of the driver. Ohn-bar et al. [36] divided images into three regions: steering wheel, gear, and radio panel. A region-based model was designed to detect the presence of hands in certain pre-defined regions to classify three types of distractions. A more inclusive distracted driving dataset including four distractions (safe driving (holding steering wheel), operating shift, eating, and talking on the cellphone) was considered in Reference [37]. Zhao et al. [37] used contourlet transform for feature extraction, and different classifiers were applied including random forests, k-nearest neighbor, and multiplayer perceptron. Note that, in earlier years the datasets defined a limited number of distracted behaviors for distraction detection, and most datasets were not public. In 2016, StateFarm's distracted driver detection competition on Kaggle [38] was the first publicly available dataset for competition purposes. They defined 10 distractions to be detected: safe driving, texting using right hand, talking on the phone using right hand, texting using left hand, talking on the phone using left hand, operating the radio, drinking, reaching behind, doing hair and makeup, and talking to passenger. In 2017, Abouelnaga et al. [24] created a new AUC Distracted Driver dataset similar to StateFarm's dataset. The dataset was composed of the same 10 distraction behaviors and 31 participants from seven different countries in four different cars. They also proposed a real-time distracted driver posture classification system. Driver behavior recognition is a kind of action recognition where the performance is rapidly increasing because of the rise of deep learning models [12][13][14], where some network architectures were developed for action recognition [39]. By extending the network architecture for image classification and the spirit of the bag-of-words model to extract features and identify each video frame [39,40], static CNNs were used to identify single or several video frames, and then the final recognition result was obtained by averaging scores across the whole video [40]. However, the motion information in the temporal domain between objects is ignored and the lack of specific temporal structure leads to only a slight accuracy improvement compared to methods based on hand-crafted features. Hence, some researchers improved the network architecture [39,[41][42][43][44][45][46]. The studies [41,42] added a recurrent layer, such as an long short-term memory (LSTM), to the 2D ConvNets (2D ConvNets + LSTM) and, thus, the temporal ordering could be captured and encoded in the states. Although LSTMs on the features from the last layers of 2D ConvNets could model high-level variations, fine low-level motion could not be captured, and the detailed descriptor of the network was lost because of the backpropagation [39]. The training time was also increased because it required unrolling the network through multiple frames for backpropagation through time [27,39]. On the other hand, 3D ConvNets [43,44] seem to be a natural method to describe videos, and it can extract spatio-temporal features. However, there are two problems with this model [27]. Firstly, it is harder to train than 2D ConvNets because more parameters are needed. Secondly, due to the large network architecture and 3D filter usage, it cannot directly use the pre-trained model on large-scale datasets, such as ImageNet. As a result, it is necessary to train from scratch, and an overfitting situation would happen in the case of an increased number of parameters. In 2017, Simonyan and Zisserman [39] proposed a practical approach by modeling spatial and short temporal snapshots via two-stream networks. A single RGB image and a stack of 10 continuous optical-flow frames were input to the 2D ConvNets which were pre-trained on the ImageNet dataset, while, in the test process, multiple snapshots were sampled from the video and the action prediction was averaged. According to the experimental results, this network architecture achieved high performance on the existing benchmarks, while the network was very efficient to train and test [39]. In addition, other models such as bidirectional LSTM [45] or neural hypergraph-based model [46] for sequence prediction in natural language processing could be applied as well.
For distraction behavior recognition, some deep-learning-based methods were proposed in recent years. In 2016, Le et al. [15] trained a faster RCNN model to classify whether the hands are holding the steering wheel or not, and the results showed that the work could achieve a higher accuracy rate than found in Reference [10]. In Reference [18], Streiffer et al. proposed DarNet and investigated the mixing of different models, i.e., CNNs, RNNs, and SVMs, to detect driver distraction. In 2018, Majdi et al. [19] proposed Drive-Net, which was composed of a convolutional neural network (CNN) and a random decision forest, for the classification of 10 distracted behaviors. Tran et al. [20] utilized four different CNN models including VGG-16, AlexNet, GoogleNet, and residual network to classify 10 distracted behaviors. Yen et al. [47] applied a CNN model to extract features and recognize four postures, including normal driving, using a cell phone call, eating, and smoking on the Southeast University (SEU) driving posture dataset [37]. Note that the authors also conducted the system on self-compiled infrared images, which could be invariant to illumination changes seen in daytime/night conditions for real driving environments. The results demonstrated better performance than conventional classifiers, e.g., SVM, with hand-crafted features. In 2017, Abouelnaga et al. [18,24] proposed a real-time distracted driver posture classification system. The input image was firstly pre-processed by the detection and segmentation methods that were applied to the regions of face, hands, and skin, and then a weighted ensemble of four different convolutional neural networks was applied for classification. In 2018, Masood et al. [26] applied VGG-16 and VGG-19 to recognize 10 distraction actions on the StateFarm dataset and showed that the impressive results achieved by the CNN models and the usage of the pre-trained model indeed saved training time. Baheti et al. [25] proposed a CNN-based system by modifying the VGG-16 network, including using the leaky rectified linear unit (Leaky ReLU) activation function instead of ReLU and applying various regularization techniques to cope with the problem of overfitting. Moreover, the authors proposed the network of modified-VGG16 by replacing the whole fully connected layer with a convolutional layer in order to reduce the number of network parameters. The results showed that the system could achieve 96.31% accuracy on the AUC Distracted Driver dataset. However, in recent works [19,20,[24][25][26], only spatial information was considered, but temporal information, related to important cues for behavior recognition, were discarded. In Reference [48], Chuang et al. proposed a skeleton-based and a point-cloud approach with multiple views based on Kinect depth cameras for driver behavior recognition, and LSTM was adopted to train the behavior model. The authors evaluated the proposed system on the VAP multiple views dataset which was collected in the laboratory to simulate an in-vehicle scene. However, the distance between the sensor and the driver was missed, and there was a difficulty in the training process inherited from the LSTM model [27,39]. Additionally, there were related issues such as driver intention prediction to anticipate driver maneuvers. In Reference [49], Gebert et al. proposed an end-to-end network architecture which consisted of FlowNet [26] to extract optical flow, a 3D residual network for maneuver classification, and an LSTM model for handling temporal data with varying length. Note that FlowNet was used to extract the optical flow in the video interpolation as well; however, labeling the ground-truth flow data to train FlowNet for a specific task is hard work and time-consuming.
Moreover, in order to cope with illumination changes encountered in realistic driving scenarios, there are other methods of analyzing data captured by illumination-invariant sensors such as depth sensors [9,48,50], infrared cameras [47], or fusion of various sensor types [9,51]. In 2019, Martin et al. [52] presented the first large-scale dataset, Drive & Act dataset, for fine-grained categorization of driver behavior. The image frames were captured by six different views and three modalities that were collected by five near-infrared cameras and Kinect v2 cameras used to acquire color, infrared, and depth data. The authors aimed to facilitate researches for video-and pose-based action recognition.

Method
For action recognition, temporal information is important, as well as spatial information. Motivated by work [27] using temporal information, we proposed an architecture of two-stream convolutional networks for distracted detection as shown in Figure 1. The network is composed of three sub-networks: spatial stream ConvNet, temporal stream ConvNet, and a fusion network. Spatial stream ConvNet and temporal stream ConvNet are used to extract the spatial and temporal features, respectively, and the different features are integrated in the fusion network. By benefitting from transfer learning, the spatial stream ConvNet was designed based on the famous network configuration, VGG-16, and the pre-trained model on the ImageNet dataset can be applied. The average pooling in the temporal dimension is firstly performed on 10 consecutive RGB images, and the result is passed to the following layers of the spatial stream ConvNet. Via convolution operators performed in the convolutional layers, the spatial features are obtained from the feature map in the last convolutional layer. On the other hand, instead of using the famous 3D ConvNet that would suffer from large parameters and overfitting when there is lack of a large training dataset [27], the temporal features are obtained by extracting the motion information via the TVL 1 optical flow [28] from video frames and then a stack of consecutive flow images capturing vertical and horizontal motion information are input to the temporal stream ConvNet. Then, the temporal features are obtained from the feature map in the last convolution layer of temporal stream ConvNet. In the end, a fusion network, consisting of two convolutional layers and two fully connected layers, was designed to integrate the spatial and temporal features to classify 10 distracted behaviors.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 14 features, respectively, and the different features are integrated in the fusion network. By benefitting from transfer learning, the spatial stream ConvNet was designed based on the famous network configuration, VGG-16, and the pre-trained model on the ImageNet dataset can be applied. The average pooling in the temporal dimension is firstly performed on 10 consecutive RGB images, and the result is passed to the following layers of the spatial stream ConvNet. Via convolution operators performed in the convolutional layers, the spatial features are obtained from the feature map in the last convolutional layer. On the other hand, instead of using the famous 3D ConvNet that would suffer from large parameters and overfitting when there is lack of a large training dataset [27], the temporal features are obtained by extracting the motion information via the TVL 1 optical flow [28] from video frames and then a stack of consecutive flow images capturing vertical and horizontal motion information are input to the temporal stream ConvNet. Then, the temporal features are obtained from the feature map in the last convolution layer of temporal stream ConvNet. In the end, a fusion network, consisting of two convolutional layers and two fully connected layers, was designed to integrate the spatial and temporal features to classify 10 distracted behaviors.

Spatial Stream ConvNet
The convolutional neural network (CNN) plays an important role in deep learning models. In recent years, many studies relied on famous network architectures such as AlexNet [12], VGG-16 [21], and GoogleNet [22,23]. These networks usually consist of convolutional layers and pooling layers, followed by few fully connected layers. Some studies showed that better accuracy can be achieved using a deeper network. By considering the balance of performance and computation efficiency and the promising results [26], the spatial stream ConvNet was based on the network configuration of VGG-16 [21], which was designed for image classification and was proven to effectively extract the features of images layer by layer. Figure 2 shows the network configuration of spatial stream ConvNet. The network input is a stack of 10 consecutive RGB frames, each of which were resized to 224 × 224 pixels. By benefitting from the transfer learning of using the pre-trained model on the largescale dataset, averaging pooling of the temporal direction is firstly performed on these input frames, and the resulting map with the size of 224 × 224 × 3 pixels is obtained. Following the average pooling layer, the network is composed of 13 convolution layers, while five max pooling and the ReLU activation function are set in each layer. Unlike other CNN models which set different kernel size in convolutional layers, a small kernel size of 3 × 3 pixels is set in all convolution layers which keeps the scale-invariant feature transform after convolution by using the same padding mechanism. The max pooling can help to extract the feature information of a larger area. Although the configuration of the VGG-16 is simple and effective, the large number of parameters (140 million) on fully connected

Spatial Stream ConvNet
The convolutional neural network (CNN) plays an important role in deep learning models. In recent years, many studies relied on famous network architectures such as AlexNet [12], VGG-16 [21], and GoogleNet [22,23]. These networks usually consist of convolutional layers and pooling layers, followed by few fully connected layers. Some studies showed that better accuracy can be achieved using a deeper network. By considering the balance of performance and computation efficiency and the promising results [26], the spatial stream ConvNet was based on the network configuration of VGG-16 [21], which was designed for image classification and was proven to effectively extract the features of images layer by layer. Figure 2 shows the network configuration of spatial stream ConvNet. The network input is a stack of 10 consecutive RGB frames, each of which were resized to 224 × 224 pixels. By benefitting from the transfer learning of using the pre-trained model on the large-scale dataset, averaging pooling of the temporal direction is firstly performed on these input frames, and the resulting map with the size of 224 × 224 × 3 pixels is obtained. Following the average pooling layer, the network is composed of 13 convolution layers, while five max pooling and the ReLU activation function are set in each layer. Unlike other CNN models which set different kernel size in convolutional layers, a small kernel size of 3 × 3 pixels is set in all convolution layers which keeps the scale-invariant feature transform after convolution by using the same padding mechanism. The max pooling can help to extract the feature information of a larger area. Although the configuration of the VGG-16 is simple and effective, the large number of parameters (140 million) on fully connected layers is the main problem [21]. This leads to high cost for the training and test process. Hence, in our work, unlike the original VGG-16 configuration, the fully connected layer was not used in the proposed spatial stream ConvNet. Here, a feature map with the size of 7 × 7 × 256 pixels from the last convolution layer is extracted as the spatial features and further processed in the following fusion network.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 14 convolution layer is extracted as the spatial features and further processed in the following fusion network. Since the parameters of CNN models are large, transfer learning, using the pre-training model to initialize the network, is applied in the training process. For driver distracted behavior analysis, the hand movements are the main changes. Hence, rather than using the pre-trained model on the ImageNet, the spatial stream ConvNet was pre-trained on the dataset that contains the videos of N (N = 10 in our study) actions containing obvious hand movements from thUCF-101 dataset [53]. Additionally, data augmentation, including cropping, rotating, horizontal flipping, and shifting, was performed to enlarge the size of dataset. Note that, in order to obtain the pre-trained model with the ability to extract discriminative features, a softmax layer with the size of 1 × 1 × N was added after the last convolutional layer, which was removed in the fine-tuned process. The network was trained by stochastic gradient descent with a learning rate of 0.0001, decay rate of 10 −6 , and momentum value 0.9. The batch size and number of epochs were set to 32 and 100, respectively.

Temporal Stream ConvNet
Although distracted behavior analysis was studied in recent works [19,[24][25][26], only spatial information was considered while temporal information was discarded. In previous studies of action recognition, many network configurations were designed to integrate the spatial and temporal information, for example, 2D ConvNets + LSTM [41,42], 3D ConvNet [43,44], and two-stream network [27]. Instead of using the famous 3D ConvNet [43,44] that would suffer from large parameters and overfitting when there is lack of a large training dataset [27], the two-stream network was used to design the temporal stream ConvNet. In order to extract the motion information, TVL 1 optical flow [28] with default parameters in OpenCV is firstly applied to obtain vertical and horizontal flow frames between two consecutive frames.
Then, the optical frames of two directions are concatenated, and a stack of 20 flow images with the size of 224 × 224 × 20 pixels are input to the temporal stream ConvNet. Figure 3 shows the configuration of the temporal stream ConvNet. Following the input layer, the network is composed of 13 convolution layers with a kernel size of 3 × 3 pixels, while five max pooling and the ReLU activation function are set in each layer. Note that, in our work, unlike the original VGG-16 configuration, the fully connected layers were not used in the proposed temporal stream ConvNet. Here, a feature map with the size of 7 × 7 × 256 pixels from the last convolution layer is extracted as the temporal features and fused with the spatial features in the following fusion network. Since the parameters of CNN models are large, transfer learning, using the pre-training model to initialize the network, is applied in the training process. For driver distracted behavior analysis, the hand movements are the main changes. Hence, rather than using the pre-trained model on the ImageNet, the spatial stream ConvNet was pre-trained on the dataset that contains the videos of N (N = 10 in our study) actions containing obvious hand movements from thUCF-101 dataset [53]. Additionally, data augmentation, including cropping, rotating, horizontal flipping, and shifting, was performed to enlarge the size of dataset. Note that, in order to obtain the pre-trained model with the ability to extract discriminative features, a softmax layer with the size of 1 × 1 × N was added after the last convolutional layer, which was removed in the fine-tuned process. The network was trained by stochastic gradient descent with a learning rate of 0.0001, decay rate of 10 −6 , and momentum value 0.9. The batch size and number of epochs were set to 32 and 100, respectively.

Temporal Stream ConvNet
Although distracted behavior analysis was studied in recent works [19,[24][25][26], only spatial information was considered while temporal information was discarded. In previous studies of action recognition, many network configurations were designed to integrate the spatial and temporal information, for example, 2D ConvNets + LSTM [41,42], 3D ConvNet [43,44], and two-stream network [27]. Instead of using the famous 3D ConvNet [43,44] that would suffer from large parameters and overfitting when there is lack of a large training dataset [27], the two-stream network was used to design the temporal stream ConvNet. In order to extract the motion information, TVL 1 optical flow [28] with default parameters in OpenCV is firstly applied to obtain vertical and horizontal flow frames between two consecutive frames.
Then, the optical frames of two directions are concatenated, and a stack of 20 flow images with the size of 224 × 224 × 20 pixels are input to the temporal stream ConvNet. Figure 3 shows the configuration of the temporal stream ConvNet. Following the input layer, the network is composed of 13 convolution layers with a kernel size of 3 × 3 pixels, while five max pooling and the ReLU activation function are set in each layer. Note that, in our work, unlike the original VGG-16 configuration, the Appl. Sci. 2020, 10, 1908 8 of 14 fully connected layers were not used in the proposed temporal stream ConvNet. Here, a feature map with the size of 7 × 7 × 256 pixels from the last convolution layer is extracted as the temporal features and fused with the spatial features in the following fusion network. In order to train the temporal stream ConvNet, the pre-trained dataset, consisting of videos of N (N = 10) actions involving hand motions from the UCF-101 dataset, as used in the training process of the spatial stream ConvNet, was applied. Data augmentation involving random cropping and horizontal flipping was also performed to enlarge the dataset. The vertical and horizontal motion information was estimated between two frames via TVL 1 optical flow. Additionally, in order to reduce the impact caused by the camera movement, the mean flow frame of each direction is calculated, and the flow frames were subtracted from the corresponding mean frame. For the pretraining process, a softmax layer with the size of 1 × 1 × N was added after the last convolutional layer, which was removed in the fine-tuned process, and the hyper-parameters were the same as used in the spatial stream ConvNet. The pre-trained model was then obtained, which was used to initialize the temporal stream ConvNet in the training process of the whole network.

Classification of Distractions by a Fusion Network
After the spatial and temporal stream ConvNet, the spatial and temporal feature maps with a size of 7 × 7 × 256 pixels are obtained. In other to fuse different modalities, score-level and featurelevel fusion are common methodologies [54]. Since the dimension of features is high, i.e., 25,088, rather than using either the manual-defined weights or weights obtained via the optimization method, e.g., genetic algorithm (GA) [24], at the score level, a fusion network was designed to fuse features in the feature level. Figure 4 shows the configuration of the fusion network. Two kinds of feature maps are concatenated in the third dimension, and then the resulting feature map with a size of 7 × 7 × 512 pixels is input to the following convolutional layer. Since the CNN model as used in the spatial and temporal stream ConvNet has promising ability of feature extraction, the fusion network was not designed with deep layers in order to reduce the risk of overfitting. In our study, the fusion network consisted of two convolutional layers, as well as one pooling layer and two fully connected layers with sizes of 1 × 1 × 512 and 1 × 1 × 10, respectively. The kernel size of the first and second convolutional layers was 1 × 1 and 3 × 3 pixels, and the ReLU activation function was applied to both convolutional layers. The classification result of 10 distracted behaviors is obtained in the softmax layer. In order to train the temporal stream ConvNet, the pre-trained dataset, consisting of videos of N (N = 10) actions involving hand motions from the UCF-101 dataset, as used in the training process of the spatial stream ConvNet, was applied. Data augmentation involving random cropping and horizontal flipping was also performed to enlarge the dataset. The vertical and horizontal motion information was estimated between two frames via TVL 1 optical flow. Additionally, in order to reduce the impact caused by the camera movement, the mean flow frame of each direction is calculated, and the flow frames were subtracted from the corresponding mean frame. For the pre-training process, a softmax layer with the size of 1 × 1 × N was added after the last convolutional layer, which was removed in the fine-tuned process, and the hyper-parameters were the same as used in the spatial stream ConvNet. The pre-trained model was then obtained, which was used to initialize the temporal stream ConvNet in the training process of the whole network.

Classification of Distractions by a Fusion Network
After the spatial and temporal stream ConvNet, the spatial and temporal feature maps with a size of 7 × 7 × 256 pixels are obtained. In other to fuse different modalities, score-level and feature-level fusion are common methodologies [54]. Since the dimension of features is high, i.e., 25,088, rather than using either the manual-defined weights or weights obtained via the optimization method, e.g., genetic algorithm (GA) [24], at the score level, a fusion network was designed to fuse features in the feature level. Figure 4 shows the configuration of the fusion network. Two kinds of feature maps are concatenated in the third dimension, and then the resulting feature map with a size of 7 × 7 × 512 pixels is input to the following convolutional layer. Since the CNN model as used in the spatial and temporal stream ConvNet has promising ability of feature extraction, the fusion network was not designed with deep layers in order to reduce the risk of overfitting. In our study, the fusion network consisted of two convolutional layers, as well as one pooling layer and two fully connected layers with sizes of 1 × 1 × 512 and 1 × 1 × 10, respectively. The kernel size of the first and second convolutional layers was 1 × 1 and 3 × 3 pixels, and the ReLU activation function was applied to both convolutional layers. The classification result of 10 distracted behaviors is obtained in the softmax layer. spatial and temporal stream ConvNet has promising ability of feature extraction, the fusion network was not designed with deep layers in order to reduce the risk of overfitting. In our study, the fusion network consisted of two convolutional layers, as well as one pooling layer and two fully connected layers with sizes of 1 × 1 × 512 and 1 × 1 × 10, respectively. The kernel size of the first and second convolutional layers was 1 × 1 and 3 × 3 pixels, and the ReLU activation function was applied to both convolutional layers. The classification result of 10 distracted behaviors is obtained in the softmax layer. In the training process, the whole network, including the spatial stream ConvNet and the temporal stream ConvNet with the pre-trained weights and the fusion network, was trained by the self-compiled dataset. Note that all layers of the spatial stream ConvNet and the temporal stream ConvNet were fine-tuned. A higher dropout value of 0.9 was set in the fully connected layers to reduce interdependent learning amongst the neurons, which is an efficient way of coping with overfitting.

Experiment Results
We evaluated the performance of the proposed driver distraction detection system in the self-compiled dataset. The experimental set-up is firstly introduced, including the dataset and the computing environment. Then a performance comparison with other existing systems is performed.

Dataset and Experimental Setting
A limited set of distraction classes were included in earlier datasets and most datasets were not public. The dataset of StateFarm on Kaggle [38] was the first publicly available dataset containing 10 distraction classes, used for competition purposes. The AUC Distracted Driver dataset [24], shown in Figure 5, is similar to the StateFarm dataset with the same 10 distraction classes. However, the sampling rate is low, and the motion information between consecutive frames is unstable. Hence, in this study, we referred to the StateFarm and AUC Distracted Driver datasets to compile our own dataset as shown in Figure 6. Ten driver distraction classes were also defined [24,38], including safe driving, text right, phone right, text left, phone left, adjusting radio, drinking, reaching behind, hair or makeup, and talking to passenger. We had 20 participants and the videos were shot in the same car. The videos were divided into 198 films, 28,779 RGB images, and 57,558 optical flow images of horizontal and vertical components. All images were resized to 344 × 356 pixels before the training and test process.  In the training process, the whole network, including the spatial stream ConvNet and the temporal stream ConvNet with the pre-trained weights and the fusion network, was trained by the self-compiled dataset. Note that all layers of the spatial stream ConvNet and the temporal stream ConvNet were fine-tuned. A higher dropout value of 0.9 was set in the fully connected layers to reduce interdependent learning amongst the neurons, which is an efficient way of coping with overfitting.

Experiment Results
We evaluated the performance of the proposed driver distraction detection system in the selfcompiled dataset. The experimental set-up is firstly introduced, including the dataset and the computing environment. Then a performance comparison with other existing systems is performed.

Dataset and Experimental Setting
A limited set of distraction classes were included in earlier datasets and most datasets were not public. The dataset of StateFarm on Kaggle [38] was the first publicly available dataset containing 10 distraction classes, used for competition purposes. The AUC Distracted Driver dataset [24], shown in Figure 5, is similar to the StateFarm dataset with the same 10 distraction classes. However, the sampling rate is low, and the motion information between consecutive frames is unstable. Hence, in this study, we referred to the StateFarm and AUC Distracted Driver datasets to compile our own dataset as shown in Figure 6. Ten driver distraction classes were also defined [24,38], including safe driving, text right, phone right, text left, phone left, adjusting radio, drinking, reaching behind, hair or makeup, and talking to passenger. We had 20 participants and the videos were shot in the same car. The videos were divided into 198 films, 28,779 RGB images, and 57,558 optical flow images of horizontal and vertical components. All images were resized to 344 × 356 pixels before the training and test process.  Due to the limited amount of data in our dataset, the pre-trained model on UCF-101 [53] was used for network initialization, and then all the layers of spatial and temporal ConvNet streams were fine-tuned. In order to reduce the training time, videos of the 10 actions with large hand movements were selected from the UCF-101 dataset. All experiments were performed on a personal computer (PC) with an NVIDIA GeForce GTX TITAN X graphics processing unit (GPU) having 3072 CUDA cores with 12 GB of random-access memory (RAM), using an Intel i7-6700K central processing unit CPU and 16 GB of RAM; the frameworks of Keras and Tensorflow were used.

Comparison with Existing Systems
We firstly performed experiments on the AUC Distracted Driver Dataset to evaluate the accuracy of the spatial network. As with the train/test data split proposed in Reference [17], it was found that there was a high correlation between training and test data. In other words, similar images of a person performing one distraction behavior would be in the training and test data. In order to avoid this situation, we re-split the AUC dataset according to the drivers' identifier (ID), and the IDs used in training and test data were separated. The experimental results were compared with Reference [25] as shown in Table 1. "Original Split" means the original data split used in Reference [25], and "Re-Split" means the data was re-split according to the drivers' IDs. The study applying VGG-16 with regularization was compared. Note that we only evaluated the performance of the spatial stream ConvNet because, in the AUC Distracted Driver Dataset, the sampling rate was low which resulted in discontinuous motion. This would cause very large errors for motion estimation, and the proposed temporal stream ConvNet was not suitable for this case. According to the results, it was found that the data split affected the performance much. The results of using different data splits were quite different. The regularization methodology could increase the accuracy rate by about 1% to 2%. Re-Split 77.15 Since the displacement of the objects between two consecutive frames is large in the AUC Distracted Driver Dataset, it might cause errors in the process of motion estimation. We compiled our distraction dataset for performance evaluation. In the training process, stochastic gradient descent was used for an optimization learning rate of 0.0001, decay rate of 10 −6 , and momentum value of 0.9.  Due to the limited amount of data in our dataset, the pre-trained model on UCF-101 [53] was used for network initialization, and then all the layers of spatial and temporal ConvNet streams were fine-tuned. In order to reduce the training time, videos of the 10 actions with large hand movements were selected from the UCF-101 dataset. All experiments were performed on a personal computer (PC) with an NVIDIA GeForce GTX TITAN X graphics processing unit (GPU) having 3072 CUDA cores with 12 GB of random-access memory (RAM), using an Intel i7-6700K central processing unit CPU and 16 GB of RAM; the frameworks of Keras and Tensorflow were used.

Comparison with Existing Systems
We firstly performed experiments on the AUC Distracted Driver Dataset to evaluate the accuracy of the spatial network. As with the train/test data split proposed in Reference [17], it was found that there was a high correlation between training and test data. In other words, similar images of a person performing one distraction behavior would be in the training and test data. In order to avoid this situation, we re-split the AUC dataset according to the drivers' identifier (ID), and the IDs used in training and test data were separated. The experimental results were compared with Reference [25] as shown in Table 1. "Original Split" means the original data split used in Reference [25], and "Re-Split" means the data was re-split according to the drivers' IDs. The study applying VGG-16 with regularization was compared. Note that we only evaluated the performance of the spatial stream ConvNet because, in the AUC Distracted Driver Dataset, the sampling rate was low which resulted in discontinuous motion. This would cause very large errors for motion estimation, and the proposed temporal stream ConvNet was not suitable for this case. According to the results, it was found that the data split affected the performance much. The results of using different data splits were quite different. The regularization methodology could increase the accuracy rate by about 1% to 2%. Since the displacement of the objects between two consecutive frames is large in the AUC Distracted Driver Dataset, it might cause errors in the process of motion estimation. We compiled our distraction dataset for performance evaluation. In the training process, stochastic gradient descent was used for an optimization learning rate of 0.0001, decay rate of 10 −6 , and momentum value of 0.9. The batch size and number of epochs were set to 16 and 500 respectively. We compared the results with the two-stream method [27], and the results are shown in Table 2. Firstly, it was found that the two-stream method [27] and the proposed method indeed increased the accuracy rate of the pre-trained model, especially for the temporal stream ConvNet in the proposed method. Secondly, fusion results were better than using either spatial or temporal stream ConvNet alone. Hence, fusing information is recommended. In addition, the proposed method increased the accuracy rate by nearly 30% compared with Reference [27].  Table 3 shows the confusion matrix for distraction behavior analysis in the self-compiled dataset. The proposed system provided better results for the actions of text left and phone left, while the action of phone right was easily confused with the action of hair or makeup.

Conclusions
Driver distraction is one of the major causes of traffic accidents, especially in Taiwan. It would be worth developing a system to detect driver distraction for automatic vehicles. Inspired by the famous two-stream CNN model, we proposed a driver behavior analysis system using CNNs to analyze the input consecutive frames, and the feature maps from the last convolution layer were extracted as the spatial and temporal features for further classification. Unlike previous studies using manually defined weights or weights obtained via the optimization process, a fusion network was designed to integrate the modality features for classification. In addition, a self-compiled dataset of 10 actions in the vehicle was established. According to the experimental results, the proposed system can increase the accuracy rate by nearly 30% compared to the original two-stream CNN model. In the future, a comprehensive network configuration could be designed for more fine-grained actions and various scene challenges, such as behavior recognition in the night.