Research on Driver Status Recognition System of Intelligent Vehicle Terminal Based on Deep Learning

: Automobile safety driving technology is a hot topic in today’s society, which is very significant to the social transportation system. Vehicle driving behavior monitoring is the foundation and core of safe driving techniques. The research on existing vehicle safety technology can not only improve the under-standing of current safe driving research progress, but also provide reference for future researchers. This paper proposes a state recognition system based on a three-dimensional convolutional neural network, which can identify several improper states frequently encountered by drivers during driving, including drinking, making phone calls, and smoking, and can also issue alarm interventions. The system takes the collected continuous video frame information as the input of the three-dimensional convolutional network, carries out multi-level feature extraction and spatio-temporal information fusion, and identifies the driver state according to the extracted spatio-temporal features. The state is judged by the facial feature points of the video stream, and the design of the video surveillance driver state recognition system is completed. Then, the driver status recognition is improved and optimized, and finally, the actual deployment of the driver status recognition system on the mobile terminal is completed. A large number of experimental results show that the driver status recognition system proposed in this paper has achieved upper identification accuracy.


Introduction
The rapid growth in car ownership over the past few years has led to a corresponding increase in traffic accidents. According to the World Health Organization (WHO), traffic accidents are among the top 10 causes of death globally [1]. Volvo's accident report shows that nearly 90% of road traffic accidents are caused by human error. Advanced driver assistance systems are considered to be an effective solution to reduce human error and the corresponding traffic accident rate.
Several studies on monitoring driving behavior have been conducted in recent years. So far, the physiological detection of drivers mainly includes EEG signal detection [2], ECG signal detection [3], and EMG signal detection [4,5]. This detection method has strong real-time performance and reliable results. However, it is also obvious that the electrode is attached to the driver, which is not easy to accept and affects the normal driving of the driver, so it has no value of promotion. By contrast, the electrical signals of the eye [6] are easier to collect than EEG and can avoid slight noise, but they still need to be collected in real time by wearing a head device. The video-based driver state recognition system proposed in this paper collects the driver state through the vehicle camera and extracts the feature information [7] in the video by using the deep learning algorithm [8]. Then, the system identifies the driver state [9] based on this feature information, finds the abnormal state in time, and issues an alarm. The system proposed in this paper will not cause interference to the normal driving of the driver, and its practicability is very strong and easy to be popularized.
In the field of target detection [10], deep learning [11][12][13] has great advantages, which also promote the study of fatigue driving detection. Zhu et al. proposed a fatigue de-tection regression model based on EOG. The model uses a convolutional neural network (CNN) [14,15] for unsupervised feature learning expression, which replaces the process of artificial design feature extraction [16,17]. In addition, the linear dynamic system (LDS) [18,19] algorithm is used in post-processing to greatly reduce the unbeneficial interference. In 2017, Zeng et al. proposed a super-resolution reconstruction method to improve the convolutional neural network and apply it to a single image. It is a new network that combines a dense residual network and deconvolutional network [20,21]. In terms of processing single images and multi-level processing, it is easier to perform image reconstruction. Compared with the classic super-resolution reconstruction method, this algorithm has more advantages in the characteristics of edge integration sharpness and reconstructed image sharpness processing, and the peak value is improved. The signalto-noise ratio improves the overall quality of the reconstructed image. However, the problem is that the edge detail processing of the reconstructed image is not good enough, so it lends itself well to the reconstruction of multiple small scenes [22]. In the field of image vision, the visual characteristics of a deep network show a different robustness from those of traditional manual design in different scenes [23]. In addition to this, it possesses better robustness and more significant prediction accuracy under the condition of multiple changes.
This paper classifies driver's abnormal behaviors into the following categories: drinking, making phone calls, and smoking. Relevant public data sets were collected through the network and pretreated. Videos were divided into multiple consecutive frames, 15 frames were taken per second, and the unrelated video frames were removed. Techniques based on deep learning technology were used to complete the driver state recognition under the video monitoring system design [24], and the deep learning framework Pytorch [25,26] was used to handle good complete model training data sets. In addition, based on the improvement and optimization of driver state recognition technology, the accuracy of the model and the recognition speed were improved. Finally, the deployment of the driver status recognition system was completed on the Jetson nano development board.

Convolutional Neural Network
A neural network is a mathematical model of distributed information processing, which is stimulated by animal nerves. It was first proposed by the psychologist W. McCall-Loch and the mathematical logician W.Pitts, and is still in use today. The most basic component of the neural network is the neuron model [27], which is shown in Figure 1. Each circle in Figure 1 represents a neuron, and each line represents the connection of a neuron, which is divided into many layers with links between each layer and none between the same layer [28]. It can be seen from the figure that the convolutional neural network is similar in process to the traditional classification method. It is just that the convolutional neural network does not need to manually extract features, but automatically learns relevant features through convolution operations, and then classifies them.
In the neural network of an organism, the neuron corresponds to the perceptron of the artificial neural network. The perceptron is composed of input human rights values, activation function, and output. A perceptron can have more than one input x, with a weight w. There are several options for activation functions, among which the most common one is: The output of the model is: In Equation (2), w is the weight, x is the input, y is the output, and b is the offset. In 1986, Hinton proposed backpropagation, in which new weights and other information were obtained by minimizing the error, and then the whole network parameters were updated. The learning rate λ (super parameter) is specified, and by multiplying the rate of change and the learning rate, information on how much each weight and the bias term changes after one or two training sessions is provided for the second training. This theory was responsible for an upsurge in the study of neural networks.
In recent years, the application of a convolutional neural network in image processing has been increasing. Due to the large amount of and variety in handwritten text, the accuracy of ordinary machine recognition is very low. However, the emergence of a convolutional neural network can solve this problem and greatly improve the accuracy. Mobile phone unlocking methods emerge in an endless stream, but in recent years, the face recognition unlocking scheme has become the most popular unlocking scheme, which is also inseparable from the development of the convolutional neural network. The most important algorithm of face recognition is based on CNN. With the proposed YOLOv3 model, image classification and recognition become more accurate and can automatically identify the target of interest, which plays an important role in both military and civilian fields.
Neural networks are composed of neural units, which include network weights and biases that can be learned. Each neural unit calculates the inputs and outputs, namely forward propagation, according to some existing formulas of the neural network [29], as shown in the Equation (3). Where w and b are the weights to be trained, x is the input, and s is the output. After comparing the output result with the sample output, the error value is obtained, as shown in the Equation (4), where d is the output and y is the truth value. The error value calculated by the mathematical model function is transmitted layer by layer through the hidden layer, and the error and weight are repeated, that is, the backpropagation, as shown in the Equation (5), where E is the error and w is the weight, until the error converges and meets the accuracy requirement.
The deep learning model is a multi-layer feature description method constructed by a convolutional layer and hidden layer. Deep learning models mainly include a convolutional neural network (CNN), a deep Boltzmann machine (DBM), a constrained Boltzmann machine (RBM), and other models [30].
The convolutional neural network has its special structural forms: a data layer, convolutional layer, pooling layer, activation layer, full connection layer, and an output result layer [31]. Among them, the convolutional layer is the core of the convolutional neural network. The convolutional layer is the convolution operation of the image, that is, covering the filter at a certain position of the image. The value is multiplied in the filter by the value of the corresponding pixel in the image. The above product is added up, and the sum is the value of the target pixel in the output image. This is repeated for all locations of the image. The activation layer is realized by the activation function, which adds nonlinear characteristics to the convolutional neural network and enables the neural network to approximate any nonlinear function arbitrarily. Common activation functions include Sigmoid function, Relu function, Tanh function, etc.
The pooling layer generally adopts the maximum pooling layer, and the maximum pooling layer takes the maximum value of the local area. It has the property that the maximum value of the corresponding local area does not change after a certain scale change, and the feature map is guaranteed to be invariant. In addition, the pooling layer can also reduce the dimension of features, reduce the amount of calculation, and speed up the reasoning speed.

3D Convolutional Neural Network
The convolutional network for image recognition is generally a 2D convolutional neural network, while this paper focuses on video recognition with one dimension more information than the image. Therefore, this paper selects a 3D convolutional neural network for this study [32].
Compared with the 2D convolutional neural network, the 3D convolutional neural network is more suitable for driver state recognition. Through 3D convolution and 3D pooling, the 3D convolutional neural network can model time and space information. In the driver state recognition system, the 3D convolutional neural network has more advantages. In a 3D convolutional neural network, convolution and pooling operate simultaneously in time and space, whereas in the 2D convolutional neural network, convolution and pooling can only be operated in space. On the timeline, the 2D convolutional neural network has no time factor in the process of convolution. So, compared with the traditional low-dimensional convolution, 3D convolution is more suitable for multi-volume and persistent motion state recognition. Figure 2 is a comparison diagram of the principles of a 2D convolutional neural network and a 3D convolutional neural network. Through comparison, it can be seen that when processing a single image and video stream, the model processed by the 2D convolutional neural network outputs a single image. When the video stream is convolved to output a single image, the time information is lost, and the information on the video time axis cannot be merged. However, in contrast to the 3D convolutional neural network, the output after the input of the video stream is a complete 3D feature map, which also contains time and space information. Therefore, the 3D convolutional neural network is suitable for driver state recognition in the surveillance video studied in this paper.
A C3D convolutional neural network model is selected in this paper. This model is a 3D convolutional neural network for behavior recognition in video. It is characterized by a simple model, which can fully extract the time and space information of video and has a relatively high operating efficiency on the premise of ensuring accuracy. A C3D convolutional neural network consists of 7 parts. The first and second parts are composed of a convolutional layer and a pooling layer. The third to fifth parts are composed of two convolutional layers and one pooling layer. The sixth part is two full connection layers. The seventh part is the Softmax layer [33]. The Softmax function is shown in Equation (6). A C3D convolutional neural network is shown in Figure 3.  Figure 3. C3D convolutional neural network structure. This architecture consists of 1 hardwired layer, 3 convolution layers, 2 subsampling layers, and 1 full connection layer. Among them, H is hardwired layer, C is convolution layer, S is subsampling layer, 7@60 × 40 represents 7 continuous frames of 60 × 40.

YOLOv3
The YOLOv3 model and YOLOv1 and YOLOv2 networks belong to the end-to-end network of the YOLO series. The YOLO network uses full-image information to make predictions. Unlike the sliding window method and the region proposal-based method, the YOLO network trains and predicts. In the process, it can make full use of the whole picture information for prediction, and can learn the generalized information of the target, which has a certain universality. Compared with the previous YOLO series network models, the YOLOv3 network model mainly achieves the best trade-off between detection speed and accuracy. Experiments show that on Tesla V100, the real-time detection speed of the MS COCO data set reaches 65 FPS, and the accuracy reaches 43.5% AP [34].
YOLOv3 is an efficient and powerful target detection network [35]. Existing papers have verified a large number of advanced technologies that affect target detection performance; the current advanced target detection method is improved to make it more effective and more suitable for single GPU training. These improvements include CBN, PAN, SAM, etc.
YOLOv3 divides the image into S × S grids, and the grid at the center of the target is responsible for completing the prediction of the target. In order to complete the detection of C-type targets, each grid needs to predict B bounding boxes and P conditional category probabilities (P = C), and output the confidence information that the bounding box contains the target and the accuracy of the bounding box. The calculation method of the confidence level corresponding to each bounding box is as follows: where o is the detected target; P(o) is the probability that the detected target is contained in the grid; I truth pred is the intersection ratio (IOU) of the predicted bounding box and the true bounding box. If the grid contains the target, that is, the center of the target falls within the grid, it is 1, otherwise it is 0; the category confidence corresponding to each bounding box is composed of the product of the confidence of each bounding box and the conditional category probability; the calculation method is: where: c l is the category of the detected target; l is the category number, l = 1, 2, . . . , C. YOLO creatively combines the two stages of candidate area and object recognition, so you can see which objects are there and where they are at a glance. In fact, YOLO does not actually eliminate candidate areas but instead uses predefined candidate areas.
YOLO first used the ImageNet data set to pre-train the first 20-layer convolutional network, and then used the complete network to train and predict object recognition and location on the Pascal VOC data set.The network structure of YOLO is shown in the    The last layer of YOLO uses a linear activation function, while the rest of the layers are Leaky Relu. Drop out and data augmentation are used in training to prevent overfitting.
After the neural network structure is determined, the training effect is determined by the loss function and the optimizer. YOLO uses the ordinary gradient descent method as the optimizer. Equation (9) is the key to the loss function of YOLO:

Advantages of YOLOv3
Compared with the previous network, Yolov3 has a better backbone network (like Resnet) with better accuracy. It is worth noting that YOLOv3 has three boxes in each cell, and each box has five basic parameters. So, for a 416 × 416 picture, there are 845 bounding boxes in v2 and 10,467 in v3. In the cost function, YOLOv3 makes a modification: it does not use softmax (the softmax layer assumes that an image or an object belongs to only one category), but uses a logistic regression layer to classify each category, mainly using the sigmoid function, the output of which can be constrained in the range of 0 to 1. Therefore, when the output of a certain type of image after feature extraction is constrained by the sigmoid function, if it is greater than 0.5, it means that it belongs to that category, so that a box can predict multiple categories. In addition, after comparison, in the network comparison of various versions, I will not say more about the software advantages. The performance of Yolov3 is already sufficient to meet the needs of this experiment. In terms of hardware, although v4 and v5 have better performance, the investment in cost is larger than v3. In summary, YOLOv3 is the most cost-effective technique and is more suitable for this experiment.

Experiments
In this paper, the driver's abnormal behavior is divided into the following categories: drinking, making phone calls, and smoking. The framework of the driver status recognition system is shown in Figure 5.

Experimental Environment Construction
This experiment was conducted under the Linux system, based on the Pytorch framework. Finally, we transplanted the trained network to a Jetson nano B01 development board and tested the driver's attitude by calling the camera. We placed our equipment in the car without affecting the driver's sight; the configuration status of the vehicle is shown in Figure 6. NVIDIA released the Jetson nano development kit at the NVIDIA GPU Technology Conference in 2019. It has excellent image processing ability and an integrated CUDA function. The Jetson nano uses a four core 64 bit arm CPU and 128 core integrated NVIDIA GPU, which can provide 472 gflops computing performance. The Jetson nano has 4 GB lpddr4 memory in an efficient, low-power package with 5 W/10 W power mode and 5V DC input [36]. We can migrate the system to the Jetson nano development board, benefiting from the powerful performance of the development board. The arrangement of experimental equipment is shown in Figure 7. The project can be transplanted to the Jetson nano. It can be lightweight, more portable, and has rich peripheral resources. The Jetson nano provides real-time computer vision and reasoning for a variety of complex deep neural network (DNN) models. In the intelligent edge detection of the Internet of Things, device connection and system formation have their architecture. Even transfer learning can use the ML framework to retrain the network locally on the development kit.

Data Set Production
The training of the deep model requires a lot of data, and the data cannot have large similarities. In this study, the data set pictures we used were composed of open-source pictures on the Internet and photos taken by ourselves.
This design uses YOLO Mark to label pictures. YOLO Mark is YOLO's data set labeling software, which is very convenient.
The prepared picture set should be divided into the training set, validation set and test set. The training set is the parameters used to train the model, and the data samples used for model fitting. The validation set is a set of samples set aside separately during the model training process, which is used to train the hyperparameters of the model. Different combinations of hyperparameters correspond to different potential models. What runs on the verification set is actually a collection of models. The the verification set exists to find the best-performing model from this bunch of possible models. It can be used to adjust the hyperparameters of the model and to make a preliminary assessment of the model's capabilities.
In the neural network, we used the verification data set to find the optimal network depth, or determine the stopping point of the backpropagation algorithm or select the number of hidden layer neurons in the neural network.
The commonly used cross-validation in ordinary machine learning is subdividing the training data set itself into different validation data sets to train the model. The test set is used to evaluate the generalization ability of the final model. However, it cannot be used as a basis for algorithm-related selection such as parameter tuning and selection of features.

Results
In this paper, Pytorch was the deep learning framework for algorithm development and experiment. It can be seen from the loss function curve in Figure 8 that as the number of training data increases, the loss continues to converge, and it basically converged at 35,000, and the change is not obvious. It shows that the training of the model is effective. Based on the information contained in Figure 9, we can see that the accuracy constantly improves. Among them, the accuracy of making a call is the fastest to reach its peak, followed by drinking water, and smoking is the slowest; in the whole process, the accuracy of making a call and drinking water was above 90%, while the accuracy of smoking was slightly lower, perhaps this is due to the production of the data set.  This study then applied our model to a real-world scenario. Several volunteers participated in the production of the data of this actual scene. They were responsible for making the prescribed actions and simulating the irregular actions of the driver in the driving process. The acquisition system arranged in the real car was responsible for shooting and real-time analysis.
The test results of this experiment are shown in the Figure 11. This study selected several representative pictures to show. From the figure, we can see that the system analyzed and identified the several abnormal behaviors set by our experiment, and calibrated the props related to the abnormal behaviors very accurately.
In this experiment, we found three volunteer drivers with different image characteristics to conduct multiple experiments. In addition, this experiment was carried out on a closed road, which not only guaranteed the authenticity of the experiment, but also guaranteed the safety of volunteer drivers. The test results show that the test results of the three scenarios designed this time reached an average of more than 83%, and some test results reached 91%. Most of them still had a good detection accuracy, and some of the low detection accuracy may be affected by some occlusions in the complex actual scene.

Conclusions
This paper presents a driver state recognition system based on a 3D convolutional neural network. The model parameters of the system are generated by iterative learning of a large number of training samples. Through a large number of experiments, the recognition accuracy is better, but there is still room for improvement. In practical applications, the system uses the collected continuous video frame information as the input of the system. We conduct multi-level feature extraction and spatio-temporal information fusion by studying model parameters and 3D convolutional neural networks.The system will provide space and time for driver status recognition based on the extracted features, and provide early warning of a driver's intervention being in poor status, so as to ensure people's travel safety to a certain extent.