Recognition of Drivers’ Activity Based on 1D Convolutional Neural Network

: Background and objective: Driving a car is a complex activity which involves movements of the whole body. Many studies on drivers’ behavior are conducted to improve road trafﬁc safety. Such studies involve the registration and processing of multiple signals, such as electroencephalography (EEG), electrooculography (EOG) and the images of the driver’s face. In our research, we attempt to develop a classiﬁer of scenarios related to learning to drive based on the data obtained in real road trafﬁc conditions via smart glasses. In our approach, we try to minimize the number of signals which can be used to recognize the activities performed while driving a car. Material and methods: We attempt to evaluate the drivers’ activities using both electrooculography (EOG) and a deep learning approach. To acquire data we used JINS MEME smart glasses furnished with 3-point EOG electrodes, 3-axial accelerometer and 3-axial gyroscope. Sensor data were acquired on 20 drivers (ten experienced and ten learner drivers) on the same 28.7 km route under real road conditions in southern Poland. The drivers performed several tasks while wearing the smart glasses and the tasks were linked to the signal during the drive. For the recognition of four activities (parking, driving through a roundabout, city trafﬁc and driving through an intersection), we used one-dimensional convolutional neural network (1D CNN). Results: The maximum accuracy was 95.6% on validation set and 99.8% on training set. The results prove that the model based on 1D CNN can classify the actions performed by drivers accurately. Conclusions: We have proved the feasibility of recognizing drivers’ activity based solely on EOG data, regardless of the driving experience and style. Our ﬁndings may be useful in the objective assessment of driving skills and thus, improving driving safety.


Introduction
Driving a car is a complex activity which involves movements of the whole body [1]. The decisions and behavior of drivers regarding the surrounding traffic are crucial for road safety [2]. The factors which affect the road traffic safety can be divided into two categories: the environmental factors and the state of the driver. The environmental factors include weather and road conditions. We define the state of the driver as driver's alertness, concentration (focus), cognitive abilities and the fact of performing secondary tasks.
To solve the problem of specifying driver's activities, we apply recognition based on tracking eye movements, as most human activities require eyeball movements [3][4][5]. The analysis of eye movements may help understand the reasons and determine the beginning and the end of activity. However, most eye movements are involuntary and remain out of conscious control [6].
The mainstream method for tracking eye movements in human behavior research is analyzing images registered with a camera [7], because of its advantages: small individual differences, and providing a non-contact of eye measurements. The drawbacks of using a camera to track eye movements are the trade-off between processing time and detection accuracy and the susceptibility to lighting conditions, skin color and sunglasses [8]. Moreover, cameras mounted in vehicles cannot be used to detect the state of a driver outside the vehicle [9]. An alternative method is electrooculography (EOG), a technique for measuring the resting electrical potential between the cornea and retina of the human eye [10]. The EOG signal is registered by electrodes placed around the eyes. The extracted EOG signal is then processed in order to detect the eyeball movements [4,5,[11][12][13].
The reason for choosing EOG to detect eyeball movements was the availability of JINS MEME ES_R smart glasses which can register electrooculograms in a non-invasive way (without attaching the electrodes to the body) while performing various activities (including driving a car) [9,14] and the fact that the eyeball movements contain the most information related to activities related to driving a car [2,4,15]. Another reason for using only electrooculography was to design the system for recognizing drivers' activities regardless of their style of driving and experience and to find the minimal number of attributes required to recognize such activities [16].
To the best of our knowledge, the problem of extracting and selecting appropriate features from electrooculograms in recognition of drivers' behavior has not been thoroughly investigated due to the cumbersome placement of electrodes and the breadth of the topic, which indicates a clear need for further research. The studies conducted by Niwa et al. [9] and Doniec et al. [17] are the only known study on drivers' behavior which used JINS MEME smart glasses. Another recent study on recognizing the gaze on the left turn was conducted by Stapel et al. [18].
Because the average accuracy of the classifier based on k-means clustering and BFS reported in [17] was 85% when analyzing four activities related to driving (driving on the motorway, parking, urban traffic, traffic in the neighborhood), we attempted to improve the accuracy by changing the approach to classification.
In our study, we apply a one-dimensional (1D) convolutional neural network (CNN) to perform classification on raw EOG signals without crafting features prior to classification [24]. Another advantage of using 1D CNN is the ability to retrain the model on new data sets by using transfer learning [25].
The purpose of this study was to examine whether it is possible to classify drivers' activities in real road conditions based on raw EOG signals and 1D CNN. The performance of the 1D CNN built for classification was evaluated as precision, recall and F 1 -score.
The structure is as follows: material and methods are described in Section 2, which includes the experiment setup, data preprocessing and classification. The results in Section 3 consist of the loss and accuracy graphs, the confusion matrix, and receiver operating characteristic (ROC) curves of the proposed 1D CNN model. Finally, we state that, based on the ROC curve and other performance metrics presented in the confusion matrix, the model was trained without overfitting and underfitting. In Section 4, we conclude the paper by discussing the significance of the results and advantages and limitations of our approach.

Experiment Setup
The study was conducted in real road conditions in accordance with Chapter 4 of the Act on Vehicle Drivers of the Republic of Poland [26] on two groups of volunteers: ten experienced drivers (age between 40 and 68) with minimum ten years of driving experience and ten learner drivers who attended driving lessons at a local driving school (age between 18 and 46). The participants gave consent to participate in the study. The candidates for drivers made a statement on their health in a questionnaire submitted to the driving school. Although the candidates for drivers may not reveal the actual health status in the questionnaires, we assumed that the study group did not have any health conditions which may cause a direct driving hazard.
In Poland, the drivers and candidates for drivers are subject to medical examination based on Article 39j of the Act on Road Transportation of the Republic of Poland [27] and Chapter 2 of the Act on Vehicle Drivers of the Republic of Poland [26]. The medical examinations of drivers include the examination of vision, hearing, and balance, the state of cardiovascular and respiratory system, kidneys, nervous system, including epilepsy, obstructive sleep apnea, mental health, symptoms of alcohol abuse, the use of drugs that may affect the ability to drive and other health conditions that may cause a driving hazard.
To avoid distracting the driver during the field study, data were acquired with JINS MEME ES_R smart glasses. The device consists of a three-point electrooculography (EOG) and six-axis inertial measurement unit (IMU) with a gyroscope and an accelerometer. JINS MEME smart glasses acquire ten channels of data: the acceleration and rotation in X, Y and Z axes, and four EOG channels: electric potentials on the right and left electrodes and the vertical and horizontal difference between them. All signals are sampled with the frequency of 100 Hz. The data are transmitted to a computer via Bluetooth or USB and can be exported to CSV file. Electrooculograms were recorded with three-point EOG sensor which consist of left, right and bridge electrodes and were converted into four lead (channel) recording of EOG signal: left, right, horizontal and vertical [14].
Each participant had to perform the same set of tasks while wearing the smart glasses. Sensor data were acquired and linked to scenarios related to driving by the observer sitting on a back seat (see Figure 1). Each learner driver (learner) drove the same car adapted for driving lessons and marked with the L sign. The learners drove the car under the supervision of a driving instructor, whereas other drivers drove their own cars. Sensor data from learner drivers were obtained thanks to the cooperation with a local driving school.
All participants completed their tasks on the same route of 28.7 km in Tarnowskie Góry, Radzionków, Bytom, and PiekaryŚląskie in southern Poland presented in Figure 2. The tasks to be completed during the drive were based on the regulations on practical driving tests [28] and included: • journey through a motorway; • drive straight ahead in city traffic; • passage of a section straight ahead outside of the urban area; • drive straight ahead in residential traffic; • driving through a roundabout (right turn, driving straight ahead and left turn); • driving through a crossroads (right turn, driving straight ahead and left turn); • parking (parallel, perpendicular, angled).  The route included roundabouts and parking lots shown as satellite images in Figures 3 and 4. Figure 3 presents the bird's eye view of two roundabouts (small and large) and Figure 4 presents the bird's eye view of public parking spaces with no ticketing along the Artura street in Radzionków (Poland).  Each activity was labeled manually by the researcher during the drive. A pilot (driving instructor in case of learner drivers and in other cases-the researcher) asked the driver to start performing a particular activity, and at the same time, recording began. When the task was completed, the pilot asked to stop the recording. The file with recorded data was named according to the registered activity.
The average time of completing all the tasks during the experiment was 75 min. Experienced drivers generally completed the route faster than learner drivers, regardless of road conditions [17].
The experiment was carried out following the rules of the Declaration of Helsinki of 1975, revised 2013, and with the permit issued by the Provincial Police Department in Katowice. The participants gave their informed consent for inclusion in the study. Study protocol was approved on 16 October 2018 by the Bioethics Committee of the Medical University of Silesia in Katowice (the resolution number KNW/0022/KB1/18). The identity of learner drivers is confidential under the agreement with the driving school, and the same rule applies for the experienced drivers. Experimental data as Supplementary Materials were made publicly available at IEEE DataPort [29].

Data Preprocessing
The data set consists of 520 labeled recordings of electrooculograms, acceleration and gyration signals from both experienced and inexperienced drivers acquired with JINS_MEME ES_R smart glasses and is available at the IEEE DataPort [29]. The reason for analyzing the recordings acquired form both experienced and inexperienced drivers is based on the fact that there are no significant differences in overall cognitive and motor skills [30].
The recordings were divided into four scenarios (categories): parking, driving through a roundabout, driving in city traffic and driving through an intersection chosen based on the classification accuracy reported in [17]. By considering only EOG signals, we can distinguish specific patterns associated with analyzed activities, regardless of style of driving, driving dynamics and experience visible in acceleration and gyration.
The recordings were divided into each category as follows: • parking: 120 recordings, • driving through a roundabout: 120 recordings, • driving in city traffic: 160 recordings, • driving through an intersection: 120 recordings.
Each of these activities can be further divided into categories described in the Section 2.1. We focused on recognizing 4 activities to verify the feasibility of classification based on 1D CNN.
The data were preprocessed before classification in Python 3.7.5 with Pandas, NumPy, Matplotlib and Scikit-learn. The first step was to unify the length of the signals because the length of the original signal vector varies from 320 to 32,357 samples (considering all of the categories). In order to unify the input signals, we took the maximum length value across all the recordings and tiled the shorter vectors to match that value. This approach turned out to be the easiest and the fastest way to standardize input data without losing the inherent characteristics of the signal for each category.
The second step of preprocessing was normalizing the input values to prevent the occurrence of the exploding/vanishing gradient problem [31]. After normalization, the data were divided into training and validation sets. The best performance was achieved with validation split equal to 20%. The number of samples in both sets and for each category was shown in the Figures 5 and 6.

Classification
In this study, we propose an approach to driver activity recognition using a one-dimensional convolutional neural network (1D CNN). This neural network model has proven its effectiveness in signal classification, yielding state-of-the-art results [32,33]. Because biological signals have non-linear characteristics, convolutional neural networks are an adequate choice, as they are precisely developed for recognizing non-linear patterns in the data [34]. Considering that there have not been established patterns in EOG signals related to driver's activity, we applied 1D CNN due to the ability of automatic extraction of features. By using convolutional layers, we can also visualize the set of filters after training and try to learn which characteristics of the input signal are related to certain activity.
The architecture of the proposed model is shown in Figure 7.   The architecture consists of the following elements: • Convolution with max pooling block-first convolution layer produces 64 feature maps which are then processed by activation function in order to capture non-linear patterns and followed by pooling layer with kernel size of two to reduce the extracted information. The second convolution layer generates 32 feature maps with kernel of size three (as in the first block), followed again by rectified linear unit (ReLU) and pooling layer [35]. Although the kernel size of convolutional layer may be much higher in the case of 1D CNN than in their two-dimensional (2D) counterpart, the best results were achieved with the smaller kernel. • Dropout layer-the dropout layer was set with the rate of 0.5. This layer turned out to be the key element because it prevents overfitting at the beginning of the training phase [36].

•
Dense and flatten blocks-after obtaining the data from the second convolutional pooling block, the feature maps are mapped into their one-dimensional representation and classified with the single and final layer consisting of four neurons followed by softmax probability activation function.
The conducted training process lasted 100 epochs, the batch size was 20 and the decaying learning rate was 0.001 at the start. The 1D CNN model was developed in Python 3.7.5 using Keras library with Tensorflow version 2.1 as the backend. To accelerate the computation, we set up the GPU (graphics processing unit) support with Nvidia CUDA (Compute Unified Device Architecture) version 10.1.
Data preprocessing and classification of 520 labeled recordings was run on the Nvidia GTX 1060 with 6 GB of VRAM (Video RAM) and training the proposed model for 100 epochs took circa 10-15 min. The source code of classifier was made publicly available at IEEE DataPort [29].

Results
This section provides and describes the results of classification of four analyzed driving scenarios registered in 520 labeled recordings using 1D CNN.
The accuracy on the training set was 99.8% on and 95.6% on the validation set. The performance of the training process was presented as the learning curves and the decaying learning rate curve on Figures 8-10.
The accuracy curve shows the correctness of the model's performance among the epochs. Both train and test accuracy achieved high values (above 90%) after circa 40 epochs.
The loss function is the sum of errors made after each epoch. After circa 40 iterations, loss stabilizes (circa 0 for training set and below 0.2 for validation set). Figure 10 shows how the learning rate was changed among iterations. In this case, the learning rate was halved after five epochs if the validation loss did not decrease.
The performance of the proposed classifier is presented in the form of confusion matrix ( Figure 11) and receiver operating characteristic (ROC) curves for each activity (see Figure 12).
The confusion matrix presents the numbers of cases classified to a specific group (predicted label) in comparison with their real classification (true label). The correctness of the classification is as follows: • for parking-ten out of 108 signals were classified incorrectly (two as driving through a roundabout, four as driving in city traffic and four as driving through an intersection); • for driving through a roundabout-two out of 96 signals were classified incorrectly (one as parking and one as driving in city traffic); • for driving in city traffic-three out of 120 signals were classified incorrectly (as one of each group); • for driving through an intersection-one out of 92 signals was classified incorrectly (as driving in city traffic).     Based on the confusion matrix, the following parameters (adapted for multi-category classification) were calculated: • F 1 score-combination of the two aforementioned metrics which rises when both precision and recall increases.
The calculated values of aforementioned metrics are presented in the Table 1. The highest precision was obtained for parking (0.98), the highest recall was obtained for driving through a roundabout and driving through and intersection (0.98 in both cases) and the highest F 1 -score for driving through a roundabout. The lowest precision was obtained for city traffic, and the lowest recall and F 1 -score was obtained for parking.
The ROC curve measures the capability of distinguishing given classes by the analyzed model and is presented as true positive rate to false positive rate. With the multi-class problem, we measure this ratio for each single category against all the other categories [37]. The receiver operator characteristic curve shows that the created model was well trained, without overfitting and underfitting. It also shows that there are no unbalanced ratios between true positives and false positives, which again leads to the conclusion that the model was not biased.
Based on the confusion matrix and ROC curve, we can observe that most of the misclassified samples belong to the parking activity (see Figure 11). The reason is the fact that parking can be performed in different ways (angle, perpendicular, parallel) and is also associated with frequent movement of head and eyeballs reflected in the EOG signal. However, the EOG signals linked to parking may resemble the other activities, such as driving on a roundabout.

Discussion
We built the classifier of driver's activity based on 1D CNN. Its accuracy was 99.8% on the training set and 95.6% on the validation set. The accuracy of our classifier is higher than in other studies, especially: This classifier has proven its high accuracy in classifying four driving scenarios (parking, driving through a roundabout, driving in city traffic and driving through an intersection). Its inherent ability to capture non-linear patterns in the sensor data makes 1D CNN a powerful tool in processing biologically related signals. The main drawback of this approach is the fact that it needs a fixed size input. The performance of 1D CNN deep learning model is at least 14%, better than BFS approach with soft assignment to specific configurations for the same data set. The results obtained for both data sets (training and validation sets) emphasize two points: the superiority of automatically learned functions over manually created ones used in [17], and the stability of 1D CNN deep learning architectures.
The type of dominant features fed to the classifier in [17] depend on the size of the sliding window and the BFS entropy of the extracted data frames. Moreover, the 1D CNN model is among the most efficient methods, which underlines the stability of this model and suggests its good ability to generalize on various data sets [41], including medical data [33]. The accuracy of studies on the drivers' behavior was based on monitoring one or two signals; the accuracy ranged from 60% to 80% [42].
In this study, we have proven the feasibility of driver's activity recognition based solely on EOG data, regardless of his/her experience and style of driving, which can be determined based on accelerometer and gyroscope data. Therefore, this approach may be applied to numerous real-world scenarios, such as building a system that may help improve the driving skills and driving safety, especially in smart vehicles.
Due to the increasing number of vehicles on roads, changing the paradigm of the driver training process is necessary to prevent the growth of road accidents and fatalities. The opportunity to measure the drivers' perception can provide valuable insight into drivers' attention. Accurate and inexpensive driver assistant systems may help encourage safe driving. However, real-time monitoring of behavior and driving conditions imposes technical challenges and the need for monitoring the state of the driver, especially dizziness caused by long trips, extreme changes in lighting, reflections of the glasses or the weather conditions on the road.
In future, we will address the limitations of our study: the need of providing fixed-sized input signals and misclassification of some recordings linked to parking activity. We propose developing a variable-size model which will operate on global pooling instead of a flatten layer to overcome the first limitation.
With that approach, we may propose a new method of determining the length of the input vector based on the information from additional sensors. To overcome the second limitation, we propose differentiating variants of parking and providing more robust architecture. Although we achieved satisfactory results with 1D CNN, we consider ensemble classifiers or multi-input deep learning models for recognizing a wider range of activities in the long term.

Acknowledgments:
We would like to thank the volunteers who participated in the study and the driving school for providing the opportunity to acquire signals on learner drivers. We also would like to thank the reviewers for useful comments.

Conflicts of Interest:
The authors declare no conflict of interest.

Ethical Statements:
The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Bioethics Committee of the Medical University of Silesia on 16 October 2018 (KNW/0022/KB1/18).

Abbreviations
The following abbreviations are used in this manuscript:

1D
One