1. Introduction
Understanding human behavior is on the epicenter of modern AI research. Modeling and monitoring a user’s state is critical towards designing adaptive and personalized interactions and has lead to ground-braking changes on several domains during the last few years. The transportation sector, in particular, is one of the application areas that have invested the most in “smart” monitoring with the broader goal of increasing safety and improving the quality of the overall experience [
1]. That is especially true for the automotive industry, as for many years the number of road accidents has been steadily increasing and car manufacturers have shifted their attention on the search of machine learning and AI-powered solutions.
According to the World Health Organization (WHO), each year 1.35 million people lose their lives to road accidents while 50 million get injured [
2,
3]. That translates to approximately 3700 deaths and 137,000 injuries daily. Moreover, based on the same resource, road traffic injuries are the leading cause of death for children and young adults between the ages of 5 to 29 years. Particularly, young males are three times more likely to be involved in a car accident than young females, with mobile phone usage being the most common cause of distractions. What is especially surprising according to WHO findings, is that hands-free phone usage remains almost equally dangerous to the physical interaction with the device. It is estimated that road crashes cost most countries an average of 3% of their gross domestic product while future trends show that by 2030, fatalities related to road accidents will be the fifth most common cause of mortality globally, from being ninth in 2011.
Specifically in the US, the National Highway Traffic Safety Administration (NHTSA) reports that only in 2018, 2800 lives were lost and more than 400,000 people were injured due to distracted driving. Additionally, only in 2017, 91,000 police-reported crashes involved drowsy drivers leading to an estimated 50,000 people injured and nearly 800 deaths. However, as the NHTSA suggests, there is broad agreement across the traffic safety, sleep science and public health communities that these numbers are an underestimate of the real impact that driving while being mentally or physically fatigued can have. An underestimate that occurs due to the lack of technology and tools to detect and account for drowsy driving behaviors [
4,
5].
In this work we address the problem of driver state modeling with respect to both distraction and alertness. The originality of our work stems from two main stand points. First, this is one of the very few efforts to tackle both conditions in parallel and study how they intersect. Second, in this study we focus explicitly on four different types of physiological markers; blood volume pulse, skin conductance, skin temperature and respiration. That is in contrast to the vast majority of driver monitoring systems that exploit either visual-based information such as facial and motion analytics [
6,
7,
8] or vehicular-based data such as miles per hour, steering patterns, etc. [
9,
10,
11]. The largest portion of studies that research physiological signals for driver behavior modeling, focuses on detecting and measuring stress [
12,
13,
14]; a condition that may have a latent relation with both distraction and drowsiness but is by no means identical to either of them. That is a general truth but also holds specifically under the context of driving as confirmed by Desmond et al. [
15].
Through our experimental analysis we try to answer three main questions which also summarize the scientific contribution of this work:
Which physiological indicators are most indicative of drowsy and distracted behavior?
Are there specific statistical features coming from different signals that are particularly informative?
Is it possible to jointly tackle the problems of drowsiness and distraction detection and how such a framework can be formulated?
For our experiments we use a novel dataset, compiled by our team, that consists of 45 subjects participating in a driver-simulation setup. The dataset captures varying levels of attention and alertness, across and within participants. Additionally, participants are exposed to different types of common driving distractions, with a special focus on variants of cognitive distractions, which are much harder to depict using the more popular computer vision-based approaches.
The rest of the paper is structured as follows: in the next section, we discuss how related research has tried to address the main questions targeted by this paper.
Section 3 presents the steps followed during the experimental methodology with respect to data collection, data processing and performance evaluation. In
Section 4, we present in greater depth the different classification approaches proposed by this paper.
Section 5 contains the results achieved by each technique and discusses how different features and modeling methods affect performance in each targeted scenario. At the end, we conclude by summarizing the outcomes of our research and guided by our experimental insights we suggest future research directions.
3. Dataset and Experimental Setup
We compiled a novel multimodal dataset consisting of rgb (red, green, blue), infrared, thermal, audio and physiological information. The dataset, was collected under a simulated environment with multimodal data gathered from 45 subjects. All study procedures have been reviewed and approved by the University of Michigan’s Institutional Review Board (IRB) under the identification code HUM00132603 on 31 October 2018. In total, the dataset consist of 30 male and 15 female participants, all between the ages of to 20 and 33 years old. For the purposes of this publication we focus exclusively on the four different physiological indicators.
Figure 1 illustrates the experimental setup environment.
3.1. Experimental Procedure
We held two recordings for each participant. One recording took place in the morning, usually sometime from 8 a.m. to 11 a.m., and the second recording happened during the afternoon/evening, between 4 p.m. and 8 p.m. We asked all participants to schedule the morning recording as the first task in their daily routines so that they are as less drowsy as possible. On the contrary, participants were supposed to attend the afternoon recordings later in the day, usually before going home, and were specifically instructed not to nap throughout that day until the time of the recording. Our assumption is that in different times of the day we could capture variant levels of alertness and biological rhythms and that during late afternoon recordings subjects would tend to be more drowsy. This assumption is based on several past research findings that suggest that drowsy behaviors are mostly observed either during late night or during the late afternoon and that those are also the time-slots that most related driving accidents occur [
5,
35,
36,
37,
38]. That is especially true for our specific target group (young adults) who were in their vast majority graduate and undergraduate students and participated in the afternoon recording after attending long hours of classes. Even though our analysis is representative of this age group, taking into account that age is a relevant factor regarding the degree in which drowsiness affects driving, we can not safely generalize our findings on elders at this point. The two recordings did not have to happen in the same day or in any specific order. Each recording lasted on average 45 min and consisted of three different sub-recordings; ‘baseline’, ‘free-driving’ and ‘distractions’. During each session and for both distractions and free-driving sections, the drivers were free to drive anywhere in the virtual environment, which consisted of both city-like environments and highways with low traffic, no pedestrians and good weather conditions under day-light conditions.
The ‘baseline’ recording consisted of two sub-parts: the ‘base part’ and the ‘eye-tracking’ part. In the ‘base part’ participants were asked to sit still, breath naturally and stare at the middle of the central monitor for 2.5 min. For the ‘eye-tracking’ part, subjects were shown a pre-recorded video with a target changing its position every few seconds. Participants were asked to follow the target with their gaze while acting naturally. This part lasted another 2.5 min.
During the ‘free-driving’ recording, participants had to drive uninterrupted for approximately 15 min. Before the beginning of each ‘free-driving’ recording and after explaining the basic operation controls, we gave participants a chance to drive for a few minutes so they can familiarize themselves with the simulator. To minimize the biases introduced by the relatively unfamiliar virtual-driving setup, for the purposes of this paper we used only 5 min long data segments, extracted from the last 7 min of the free-driving recording, when subjects were already used to the driving simulator.
The last part was the ‘distractions’ recording. This recording consisted of four different sub-parts that simulated different types of common driving distractors. Bellow we describe the four different distractors that participants were exposed to during each recording session.
Texting—Physical. Participants were asked to type a small text message on their personal mobile device. The text was a predefined 8-word message and was dictated to the participant by the experiment supervisor on the fly. By using predefined texts we aimed to minimize the impact of cognitive effort that subjects had to put when texting and focus more on the physical disengagement from driving. Nonetheless, texting combines all three distraction classes defined by NHTSA and the CDC, which are Manual, Visual and Cognitive. The mobile device was placed on an adjustable holder on the right side of the steering wheel and participants had the freedom to adjust the positioning of the holder at will, so that it fits their personal preferences. Thus, simulating a real-car setup as accurately as possible.
N-Back Test—Cognitive Neutral. The second distractor was the N-Back test. This distractor aimed to challenge exclusively the Cognitive capabilities of the subjects while driving. N-Back is a cognitive task extensively applied in psychology and cognitive neuroscience, designed to measure working memory [
39]. For this distractor, participants were presented with a sequence of letters, and were asked to indicate when the current letter matched the one from n steps earlier in the sequence. For our experiments we set N = 1 and deployed an auditory version of the task where subjects had to listen to a prerecorded sequence of 50 letters.
Listening to the Radio—Cognitive Emotional. For this distractor, participants were asked to listen to a pre-recorded audio from the news and then comment about what they just heard by expressing their personal thoughts. As with the N-Back Test, this distractor challenges mainly the cognitive capabilities of the participant when driving but with one major difference. In contrast to the neutral nature of the previous distractor here the recordings were emotionally provocative hence, motivating an affective response from the side of the subject. In particular, the two recordings used as stimuli for this part were related to a) a potential active shooter event that took place in the greater Detroit area and b) reporting from a fatal road accident scene which took place in the area of Chicago. These choices were made to help the users relate better to the events described in the recordings.
GPS Interaction—Cognitive Frustration. At this step, we asked participants to find a specific destination on a ’GPS’ through verbal interaction. The goal of this distractor was to induce confusion and frustration to the participant; emotions that people are likely to experience when driving, either by interacting with similar ‘smart’ systems or through the engagement with other passengers or drivers on the road. In this case the ‘GPS’ was operated by a member of the research stuff in the background providing miss-leading answers to the participant and repeating mostly useless information until the desired answer was provided.
Once the participants started driving they would not stop until the end of the recording. Thus, they did not experience any interruptions when switching from the ‘free-driving’ to the ‘distractions’ parts. For each of the distractors we had two similar alternatives, which we randomly switched between morning and afternoon recordings making sure that each subject would be exposed to a different stimuli each time they participated.
3.2. Modality Description
During each recording, the following four physiological signals were captured using the hardware equipment provided by Thought Technology Ltd and the BioGraph Infiniti software:
Blood volume pulse (BVP): BVP is an estimate of heart rate based on the volume of blood that passes through the tissues in a localized area with each beat (pulse) of the heart. The BVP sensor shines infrared light through the finger and measures the amount of light reflected by the skin. The amount of reflected light varies during each heart beat as more or less blood rushes through the capillaries. The sensor converts the reflected light into an electrical signal that is then sent to the computer to be processed. BVB has been extensively used as an indicator of psychological arousal and is widely used as a method of measuring heart rate [
40,
41]. The BVP sensor was placed on the index finger. We collect BVP at a rate of 2048 Hz.
Skin conductance: Skin conductance is collected by applying a low, undetectable and constant voltage to the skin and then measuring how the skin conductance varies. Similar to BVP, skin conductance variations are known to be associated with emotional arousal and changes in the signals produced by the sympathetic nervous system [
41,
42]. The sensor for these measurements was placed on the middle and ring fingers. Skin conductance signal is captured at 256 Hz.
Skin temperature: This sensor measures temperature on the skin’s surface and captures temperatures between 10 C and 45 C (50 F–115 F). The temperature sensor was placed on the pinky finger. Skin temperature is also captured at 256 Hz.
Respiration: The respiration sensor detects breathing by monitoring the expansion and contraction of the rib cage during inhalation and exhalation. By processing the captured periodic signal important characteristics can be computed such as respiration period, rate and amplitude. The respiration stripe was wrapped around the participant’s abdomen and the sensor was placed in the center of the body. Respiration is captured at 256 Hz.
All sensors can be seen on the top right of
Figure 1. Skin conductance, respiration and skin temperature values are padded to match the 2048Hz sampling rate used for BVP. The total amount of data in terms of time across the different recording segments is shown in
Table 1. For each segment, approximately half of the data come from the morning recordings and half from the afternoon.
3.3. Feature Extraction
Statistical features are extracted over the four raw signals from the time and frequency domains. Feature values are padded to match the maximum available sampling rate of 2048 Hz. In total, 73 statistical features are computed over the four raw physiological measurements: 49 features related to the BVP signal and 24 features coming from the rest three modalities.
BVP features: Time domain statistical features such as mean, minimum, maximum and standard deviation are computed describing both the overall behavior of the signal but also the relation between consecutive inter-beat interval (IBI). NN related features describe the interval between two normal heartbeats. pNN features refer to the total number of pairs of consecutive normalized IBI values that differ more than 50 ms [
43]. Additional features are computed to describe the spectral power statistics of different frequency bands by grouping the frequencies into three frequency bands, very-low (<0.04 Hz), low (0.04–0.15Hz) and high frequencies (0.15–0.4 Hz). For each frequency band, power related statistics are calculated.
Respiration features: Amplitude, period and respiration rate are calculated along with the standard statistics from the raw respiration signal.
Respiration + BVB features: Four features are computed that combine BVP and respiration measurements towards describing the peak to through difference in heart rate that occurs during a full breath cycle (
HR Max-Min features as seen in
Figure A1 in the
Appendix A).
Skin conductance and skin temperature features: Six features are extracted from each signal describing standard temporal statistics over short and long term windows on top of the raw the measurements. Features include the measurement as a percentage of change, the long and short term window means, the standard deviation of the short term window, the direction/gradient of the signals and the measurement as a percentage of the mean in the short term window.
Feature estimation and hyperparameter tuning (i.e., window strides and sizes) were automatically selected by the BioGraph Infinity software.
3.4. Feature Selection
To get a better understanding of how important the different features are and to reduce the high feature space, we train two Decision Tree (DT) models on the tasks of drowsiness detection and distraction detection, respectively, and we evaluate the overall feature contribution in terms of information gain.
More specifically, we train each model on all 73 features plus the four raw signals and we compute the increase in information gain caused by each feature, after every split, for both tasks. Final scores are assigned by averaging the scores for each feature over the two tasks. Equations (
1)–(
3) describe the mathematical formulation of our analysis with respect to information entropy and gain. We use Python’s scikit learn library for this purpose.
Figure A1 in the
Appendix A illustrates all 73 features and their final importance scores. The top five performing features are listed and described in
Table 2.
where
E(
x) is the entropy of feature
x,
is a specific feature value,
is the probability of
and
n is the total number of possible values that variable
x can take.
where
is the information gain with respect to feature
x for task
t,
E(
y) is the entropy of the dependent variable
y and
E(
y|
x) is the entropy of
y given feature
x.
E(
y|
x) is calculated as shown in Equation (
1) but the probabilities for the values of variable
y are calculated under the condition of feature
x.
where
is the information gain with respect to feature
x across both tasks and
is the task id.
3.5. Metrics and Evaluation
We evaluate the different models using the four evaluation metrics described below:
Sensitivity: Sensitivity (or positive recall), is estimated as the proportion of positive samples that are classified correctly. In the context of this paper, sensitivity describes the percentage of drowsy or distracted samples that are being correctly identified. The formula to compute sensitivity in terms of true positives (TP) and false negatives (FN) is: .
Specificity: Specificity (or negative recall) is estimated as the proportion of negative samples that are classified correctly. In the context of this paper, specificity describes the percentage of alert or not-distracted samples that are being correctly identified. The formula to compute sensitivity in terms of true negatives (TN) and false positives (FP) is: .
Average recall: Average recall corresponds to the mean value between specificity and sensitivity. The higher the average recall the less severe the trade-off between sensitivity and specificity.
Receiver operating characteristic curve (ROC): ROC curve is a graphical way to visualize the classification ability of a binary classifier. ROC curves describe the relation between TP-rate and FP-rate at different thresholds. FP-rate is given as 1-specificity. The area under the ROC curve is equal to the probability that the model will classify a randomly chosen positive instance higher than a randomly chosen negative one. The area under the ROC curve, also known as AUC, is a measure of the general ability of the network to discriminate between the two classes. The higher the AUC, the better the model.
3.6. Normalization and Classification Setup
Due to limited available compute, to reduce the computational demands of the problem we sub-sample all available information streams to 8Hz. Then, the data of each participant are normalized based on their afternoon baseline recording (see
Section 3.1). We choose afternoon baseline over the morning one, as it led to slightly better overall performance during experimentation. The normalization formula is shown in Equation (
4).
where
is the normalized feature value
of feature
x and
j is the participant ID.
Finally, consecutive samples are grouped into batches of 64 by using an eight second, non-overlapping windowing approach. As a result, all of our models provide one prediction every 8 s. For all classification experiments, we apply a 10-fold cross validation scheme, using at each fold 20% of the users for testing and the rest of the users for training.
5. Results
We perform three types of experiments. Initially we show our results on drowsiness detection by measuring on different feature sets using the CNN-LSTM pipeline. In, addition, we compare the best performing CNN-LSTM model against the three traditional ML classifiers (
Section 4.1). Then, we apply the same evaluation for the distraction detection task. At last, we explore how the different joint modeling approaches (
Section 4.2) perform in detecting driver drowsiness and distraction in parallel and we compare their performance against the more traditional modeling alternatives.
5.1. Single-Task Learning
Firstly we perform a general evaluation across different feature combinations using the deep CNN-LSTM pipeline. The results for drowsiness and distraction detection are presented in
Figure 4 and
Figure 5, respectively, in terms of ROC curves and AUC score. Then, for the best feature set at each task we evaluate all classifiers in terms of sensitivity, specificity and average recall and we discuss the contribution of different features and models to identify the two conditions.
In particular, the following feature combinations are being presented:
BVP: Raw BVP data plus 49 temporal features extracted from the BVP signal.
Respiration: Raw respiration data plus eight temporal features extracted from the respiration signal.
Skin conductance: Raw skin conductance data plus six temporal features extracted from the skin conductance signal.
Temperature: Raw skin temperature data plus six temporal features extracted from the skin temperature signal.
All raw data and modality-based features: The input data consist of the concatenation of all features and raw signals mentioned above.
All raw data: Only the raw data from the four physiological sources are concatenated and used as input features.
BVP + respiration (BVP-R): We evaluated different combinations of raw data as input features. Out of all the possible mixtures, combining BVP and respiration data stood out as the most efficient combination for part of our experiments.
Top #5 BVP features: The input data consist of the top #5 (
Table 2) performing features as identified through the analysis discussed in
Section 3.4.
Top #3 IBI BVP features: The input data consist of the top #3 BVP features that are related to IBI (features #1, #2 and #5 of
Table 2).
5.2. Drowsy Driver Modeling
For this experiments we split the data into the two following classes; recordings made during the morning session (8 a.m. to 11 a.m.) are labeled as ‘alert’, while recordings made during the afternoon session (4 p.m. to 8 p.m.) are marked as ‘drowsy’. We perform a 10-fold cross validation across the participants using 20% of the users for testing and the rest 80% for training.
As shown in
Figure 4 from all the evaluated feature sets, the combination of “
BVP and respiration” signals is by far the most efficient approach, achieving an AUC performance of 88%. These results are partially in line with the findings presented in
Section 3.4, which identified that specific BVP statistics are highly related to both tasks. On the other hand, in contrast to the features identified through the analysis of
Section 3.4, we observe that respiration related data are also highly associated with drowsiness. In particular, the “
top3 IBI BVP features” set along with all the “
respiration related data” are responsible for the second and third best AUC scores with 75% and 74%, respectively. However, when we combine all BVP features, the performance drops significantly to 61%. We believe that the significant increase in feature space along with the decrease in available data after sub-sampling the signals to 8Hz is partially responsible for that observation, since the network parameters do not have access to the required amount of information in order to get properly trained. In addition, it is highly possible that several BVP features are not actually good descriptors of drowsiness, thus adding noise to the input instead of actually assisting to the final decision. On the other hand, it seems that just joining the two raw information streams of BVP and respiration is sufficient for the network to capture the important characteristics of the signals. We believe that this could potentially relate to the fact that both signals have a periodic behavior and we know for a fact that characteristics related to IBIs for the BVP signal and to rate and amplitude for the respiration signal, are of special importance to the task. This can be confirmed by the high performance observed when using explicitly the top3 IBI BVP features or the set of respiration related data accordingly. The rest of the evaluated feature combinations do not offer any significant value on this task showing performance that is comparable to random guess.
Focusing on the results of
Table 3, we see that the CNN-LSTM model performs significantly better compared to all the baseline classifiers when trained only on the raw BPV and respiration data. The very poor performance of the baseline classifiers is indicative of how challenging the targeted problem is, while at the same time highlights the superiority of the deep spatio-temporal classifier compared to the traditional and more popular alternatives on the task of physiological-based drowsy driver behavior modeling. In particular, all baseline classifiers perform very poorly in terms of sensitivity with very high specificity scores. In other words, these models fail marginally to identify drowsy behavior. However, it is very unlikely to identify someone who is actually alert as drowsy. On the contrary, the CNN-LSTM model outperforms by far all baselines in terms of sensitivity with a score of 93%, while it provides worse but reasonable results in terms of specificity with a score of 71%. This means that the chance of correctly identifying drowsy behavior with this model is quite high, even though in approximately three out of ten times an alert driver will be wrongfully identified as drowsy.
5.3. Distracted Driver Modeling
For this experiment we split the data as follows; data corresponding to any of the distraction segments is labeled as ‘distracted’, while data collected under the free-driving part is labeled as ‘not-distracted’. To minimize the biases introduced by the relatively unfamiliar virtual-driving setup, we use five minute long data segments, extracted from the last seven minutes of the free-driving recording, when subjects were already used to the driving simulator.
Similarly to the previous section, we perform a feature based analysis using the deep CNN-LSTM model. By observing the ROC curves of
Figure 5 it would be safe to assume that identifying distracted behavior based on the selected feature-sets is relatively more challenging compared to detecting drowsiness. According to the AUC scores, all BVP related feature combinations provide by far the best results, indicating the strong relation of heart-rate related features to the task. More specifically, best results were achieved by the “
top5 BVP” features of
Table 2, with an AUC of 82% while second comes the “
top3 IBI BVP” feature set with an AUC of 80% and third the set with “
all BVP” related data with an AUC score of 79%. Of special interest is the very poor performance observed by the combination of raw BVP and respiration data, which provided the best results in the problem of drowsiness detection. Even though it is hard to clearly explain the very low performance of this feature set, we suspect that the overall poor performance of the respiration data on the task affects the results in a negative way. Judging from the AUC score achieved by the “respiration” feature set it seems that respiratory data are not as related to distracted behavior as they are to the drowsy one. However, these negative results need to be further evaluated in the future.
Taking these findings into consideration, we evaluate the different classifiers on their ability to identify distracted driving based on the “top5 BVP” feature set. The results are presented in
Table 4. Again, the CNN-LSTM model significantly outperforms all three baselines. The model provides the most balanced results with an average recall of 72%. The advantage of the CNN-LSTM model against its competitors, is less on its ability to identify distracted behavior and more on its balanced performance between sensitivity and specificity. In particular, the SVM model performs better on identifying non-distracted driving while its predictive ability with respect to distraction detection is almost random, thus making this approach the less appropriate of all for the purposes of the task. On the other hand, KNN and RF classifiers perform equally to the CNN-LSTM model on identifying drowsy driving. However, their high FP-rate makes them less appropriate for modeling the problem as they have almost a 50% chance on marking a not-distracted driver as distracted.
Overall, we can argue that the CNN-LSTM pipeline is by far the most effective on modeling both distracted and drowsy driver states, under the same experimentation conditions. That is both in terms of correctly identifying the condition of interest (more TPs) but also in discriminating against it (more TNs).
5.4. Multitask Learning for Joint Driving Behavior Modeling
Dedicating a single machine learning model to each condition of interest has been traditionally the most popular and effective approach of dealing with problems related to human behavior modeling. However, in several cases we are interested in predicting conditions that coexists and may overlap. Our assumption is that overlapping conditions may share a common ground in terms of physiological reactions caused to the drivers. To that end we evaluate different machine learning methods on their ability to jointly predict driver’s state in terms of distraction and alertness. For these experiments we use the combination of the seven temporal features that performed the best for the individual tasks. Hence, each training feature vector consists of the raw recordings from BVP and respiration plus the five BVP features identified through the analysis of
Section 3.4.
Figure 3 illustrates all the deep learning methods evaluated. For the cases of SVM, RF and KNN we formulate the problem similarly to the deep Scheme D model, i.e., as a 4-class classification task.
In order to have a fair comparison against the different approaches, we evaluate all models on their ability to correctly identify drowsiness and distraction as independent tasks. In the case of multi-class models (ScemeD, SVM, KNN, RF) in particular, the results are evaluated as two binary classification problems and not as a traditional four-class task. Formulating evaluation as such, allows for a one to one comparison against the multitask methods (Shemes A, B, C) and avoids diluting the characterization ability of the different classifiers with respect to the individual conditions, while still learning shared parameters between tasks.
Figure 6 shows the ROC curves for all the deep learning-based models for both tasks. Solid lines correspond to the drowsiness detection task, while dotted lines to distraction detection. Schemes A and B offer the most balanced results across the two detection problems. In particular, the two models perform comparably well to the CNN-LSTM distraction model (
Figure 5), while providing the second and third best results for the drowsiness task in terms of AUC when compared to the alternatives of
Figure 4. However, despite the fact that the performance is acceptable for both tasks, it always remains inferior to the condition-targeted classifiers. This could be partially an effect related to the limited amount of data available to properly tune the model parameters. Another possible explanation could be that other than the IBI BVP features that have a proven value on characterizing both conditions, the rest of the input features are less robust across tasks, thus hindering the model’s ability to converge at a higher overall score.
Nonetheless, it is clear that Schemes A and B offer overall the best modeling performance across all the approaches that jointly learn representations for the two conditions. These architectures are the only ones that have dedicated layers for each task, while their parameters are being updated based on an optimization error that takes into account both individual performances. Based on these results, we can see that the number of shared layers does not have a significant impact on the task-specific performance, even though this might change as the available training data increase. Scheme C, which has all layers shared across the two tasks, performs the worst in terms of AUC. This observation to some extend indicates the fact that the physiological responses caused by the two conditions are not the same but are overlapping to some extend given that the model achieves a a performance significantly higher than random for both conditions. At last, Scheme D, which is the multi-class approach, slightly underperforms compared to the branched, multitask learning approaches. The model also exploits the shared layers across tasks to learn parameters of importance related to both conditions. At the same time, the discrimination into four classes assists the model to learn how the different physiological responses relate to the presence or absence of each of the conditions simultaneously. However, splitting the data into four classes limits the available data under each category thus, having a negative impact on models performance.
Table 5 shows in more detail the performance of all the classifiers in terms of sensitivity, specificity and average recall. All deep learning based methods perform significantly better compared to the traditional machine learning models both in single-task and joint-task evaluations (see also
Table 3 and
Table 4). At the same time, Schemes A and B, which are the two multitask learning approaches with a branching architecture, provide the best results in terms of average recall. That is due to the fact that the two models offer the best trade-off between sensitivity and specificity for the detection of both conditions. Of special interest is the high performance observed by Scheme C in terms of sensitivity. The model offers higher scores than all joint-learning alternatives for the drowsiness detection task, while it outperforms by far all methods tested on the distraction detection task. That means that in terms of detecting the conditions of interest Scheme C is the most effective one. However, its poor performance in terms of specificity makes it an inappropriate model to be applied in a real life scenario, since the high rate of false alarms would lead to an over-sensitive monitoring system.
6. Conclusions
In this paper, we explore different physiological markers and machine learning approaches on their ability to describe distracted and drowsy driving. For our analysis, we compiled a dataset of 45 subjects and we recorded their BVP, respiration, skin conductance and skin temperature responses while participating in a simulated driving setup. Based on our analysis, the contribution of this publication can be summarized through the answers on the following three research questions:
With respect to drowsiness detection, BVP and respiration indicators proved to be the two signals that are mostly associated with the task. In particular, the combination of the raw BVP and respiration measurements leads to maximum drowsiness detection performance in terms of AUC score with a value of 88%, when processed through a spatio-temporal deep CNN-LSTM model. Second best performance, with a score of 75% AUC, is achieved by a subset of BVP related features when processed through the same modeling architecture, while respiration related data and features, lead to the third best performance with a score of 74% AUC. Skin conductance and temperature signals and features lead to significantly inferior performance, with their AUC scores fluctuating around 50%.
With regard to distraction detection, BVP proved to be again highly associated with the task. All feature sets extracted from that signal marginally outperform all the alternative feature combinations when processed through the same spatio-temporal CNN-LSTM architecture, by achieving AUC scores in the range between 79% to 82%. The rest of the evaluated feature sets, which consist of various combinations of the remaining physiological markers and their statistical features, always perform around 60% AUC. Hence, showing their relation to distracted behavior but also highlighting their weakness on robustly capturing the condition when used exclusively.
More specifically, we train two DT classifiers targeted on the individual tasks, using all the available data, and we perform an entropy-based evaluation of all the available features on their importance towards detecting the two conditions.
Figure A1 in the
Appendix A illustrates the importance of all features in terms of information gain after training the two models. Based on this analysis, we select the five most informative statistical features, presented in
Table 2. As it can be observed, all five features are related to BVP and in particular three are extracted from the time domain and describe patterns related to BVP IBIs and two are extracted from the spectral domain and are related to the spectral power of the signal in different frequency bands. BVP IBI related statistics alone show great performance on both tasks as they lead to the second best performance in both drowsiness detection with 75% AUC and distraction detection with 80% AUC. Interestingly, when combined with the frequency related features, the new feature set performs quite poorly on the drowsiness detection task leading to almost random performance with 50% AUC while it offers the best results on drowsiness detection with 82% AUC. It is not clear yet why adding the frequency features to the input feature set harms the classifier so abruptly with respect to drowsiness detection and this is something that we would like to investigate further in the future.
In addition to the BVP features, respiration related statistics showed strong association with drowsy driving. In particular, combining the raw respiration data with temporal features describing the temporal characteristics of the signal leads to a 74% AUC, which is the third highest score achieved by the single-task deep model for drowsiness detection. Specifically the features extracted from respiration correspond to: respiration amplitude, respiration period, respiration rate, respiration rate epoch mean (where an epoch is 5 min of data), respiration rate mean (br/min) and respiration rate std dev (br/min).
Overall, our experiments showed that deep CNN-LSTM-based methods significantly outperform all other evaluated traditional machine learning alternatives, which have the lion’s share in evaluations presented in the related literature (RFs, KNN, SVM). In particular, the single-task CNN-LSTM model leads to a maximum performance of 88% AUC with 82% average recall for the drowsiness detection task and to 82% AUC with 72% average recall for the distraction detection task. Second best performance however across both tasks, is recorded by the joint condition learning multitask schemes with a branching architecture (Schemes A and B of
Figure 3). Our evaluations highlight the potential of multitask learning towards directly addressing such abstract conditions with overlapping physiological responses. Schemes A and B offer results directly comparable to the corresponding single-task CNN-LSTM model for the distraction detection task with ~79.5% AUC and a slightly improved 73% average recall. At the same time, for the drowsiness detection task the models achieve a ~76.5% AUC and 72.5% average recall. Even though performance is lower in terms of AUC and sensitivity compared to the single-task CNN-LSTM model in the case of drowsiness detection, the classifier still performs notably higher compared to all other evaluated methods on the task.
In general, we argue that building multitask learning models with dedicated layers on every targeted task is the method that showed the most promising results, on joint condition learning. Avoiding branching and having only shared layers across tasks led to the worst results as the model struggled to effectively distinguish between conditions, since the learned features could not scale equally across tasks. The multi-class approach also offered inferior results compared to multitask learning as the division of training data into multiple groups had a negative impact on the final result.
Even though condition-specific models still offer the optimal results, our findings strongly indicate that joint condition modeling using multitask learning has great future potential on this and similar tasks and we plan to investigate this direction further in the near future. In addition, a limitation of the current analysis is that the levels of drowsiness experienced by the participants are not practically measurable. Our assumption about drowsiness is derived mostly by previous highly credible research (including NHTSA findings [
5]) and the daily schedule of our specific target group (young adults who are graduate and undergraduate university students). In future versions of the dataset, we plan to introduce additional drowsiness evaluation methods such as subjective sleepiness reporting [
56] and objective test-based evaluations [
57] to better quantify and measure the presence of drowsiness in our recordings.