On Assessing Driver Awareness of Situational Criticalities: Multi-modal Bio-Sensing and Vision-Based Analysis, Evaluations, and Insights

Automobiles for our roadways are increasingly using advanced driver assistance systems. The adoption of such new technologies requires us to develop novel perception systems not only for accurately understanding the situational context of these vehicles, but also to infer the driver’s awareness in differentiating between safe and critical situations. This manuscript focuses on the specific problem of inferring driver awareness in the context of attention analysis and hazardous incident activity. Even after the development of wearable and compact multi-modal bio-sensing systems in recent years, their application in driver awareness context has been scarcely explored. The capability of simultaneously recording different kinds of bio-sensing data in addition to traditionally employed computer vision systems provides exciting opportunities to explore the limitations of these sensor modalities. In this work, we explore the applications of three different bio-sensing modalities namely electroencephalogram (EEG), photoplethysmogram (PPG) and galvanic skin response (GSR) along with a camera-based vision system in driver awareness context. We assess the information from these sensors independently and together using both signal processing- and deep learning-based tools. We show that our methods outperform previously reported studies to classify driver attention and detecting hazardous/non-hazardous situations for short time scales of two seconds. We use EEG and vision data for high resolution temporal classification (two seconds) while additionally also employing PPG and GSR over longer time periods. We evaluate our methods by collecting user data on twelve subjects for two real-world driving datasets among which one is publicly available (KITTI dataset) while the other was collected by us (LISA dataset) with the vehicle being driven in an autonomous mode. This work presents an exhaustive evaluation of multiple sensor modalities on two different datasets for attention monitoring and hazardous events classification.


I. INTRODUCTION
W ITH the development of increasingly intelligent ve- hicles it has been possible to assess the criticality of a situation much before the event actually happens.This makes it imperative to understand from the driver's perspective about when the situation becomes critical.While computer vision continues to be the preferred sensing modality for achieving the goal of assessing driver awareness, the use of bio-sensing systems in this context has received wide attention in recent times [1], [2], [3].Most of these studies have used electroencephalogram (EEG) as the preferred bio-sensing modality.
The emergence of wearable multi-modal bio-sensing systems [4], [5] has opened a new possibility to overcome the limitations of individual bio-sensing modalities through the Siddharth and Mohan M. Trivedi are with the Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, 92093, USA e-mail: ssiddhar@eng.ucsd.edu,mtrivedi@eng.ucsd.edufusion of features from multiple modalities.For example, the information related to driver's drowsiness extracted from EEG (which suffers from low spatial resolution especially when not using a very large number of sensors) may be augmented by the use of galvanic skin response (GSR) which does not suffer from electromagnetic noise (but has a low temporal resolution).Similarly, the fusion of information in EEG with computer vision systems currently in use has not been explored in the driving context.
Driver awareness depends highly on the driver's physiology since different people react differently to fatigue and to their surroundings.This means that one-fit-for-all type of approach using computer vision based on eye blinks/closure etc. might not scale very well across drivers.It is here that the use of bio-sensing modalities (EEG, GSR, etc.) may play a useful role in assessing driver awareness by continuously monitoring the human physiology.The fusion of data from vision-based systems and bio-sensors might be able to generate more robust models for the same.Furthermore, EEG with its higher temporal resolution than other common bio-sensors may prove to be very useful for detecting hazardous vs. non-hazardous situations on short time scales (such as 1-2 seconds) if such situations do not register in the driver's facial expressions in such short time periods.Additionally, the driver's physiology can provide insights into how they react to various situations during the drive which may have a correlation with the driver's safety.For example, a heart-rate variability which has been shown to model human stress [6] may be used to assess driver's safety since s/he should not be allowed to drive when heavily stressed.
Deep Learning has many applications in vision-based drivermonitoring systems [7], [8].But, these advances have not translated towards the data from bio-sensing modalities.This is primarily due to the difficulty in collecting very large scale bio-sensing data, a prerequisite for training deep neural networks.Collecting bio-sensing data on a large scale is costly, laborious and time-consuming.It requires sensor preparation and instrumentation on the subject before the data collection can be started whereas for collecting image/videos even a smartphone's camera may suffice without the need to undergo any sensor preparation in most cases.
This study focuses on driver awareness and his/her perception of hazardous/non-hazardous situations from bio-sensing as well as vision-based perspectives.We individually use features from three bio-sensing modalities namely EEG, PPG and GSR, and vision data to compare the performance of these modalities.We also use the fusion of features to understand if and in what circumstances can it be advantageous.To this end, we present a novel feature extraction and classification pipeline that has the ability to work with real-time capability.The pipeline utilizes pre-trained deep neural networks even in the absence of very large scale bio-sensing data.To the best of our knowledge, this study is the most comprehensive view of using such widely varying sensing modalities towards assessing driver awareness and attention.Finally, we would like to emphasize that the bio-sensors used in this study are very practical to use in "real-world" i.e. they are compact in design, wireless and comfortable to use for prolonged time intervals.This choice of bio-sensors was consciously carried out so as to bridge the gap between laboratory controlled experiments and "real-world" driving scenarios.
The framework presented in this paper is not just a user study but a complete scalable framework for signal acquisition, feature extraction, and classification that has been designed with the intent to work in real-world driving scenarios.The framework is capable of working in real-time and is modular since we extract the information from each sensor modality separately.Finally, we test two hypotheses in this paper.First, we test if the modalities with low-temporal resolution (but easily wearable) namely PPG and GSR can work as well as EEG and vision modality for assessing driver's attention.Second, we test if (and when) the fusion of features from different sensor modalities boost the classification performance over using each modality independently for attention and hazardous/non-hazardous event classification.In the process of studying these two hypotheses, we extract traditional as well as deep learning based features from each modality.

II. RELATED STUDIES
Driver monitoring for assessing attention, awareness, behavior prediction, etc. has usually been done using vision as the preferred modality [10], [11], [12].This is carried out by monitoring the subject's facial expressions and eye-gaze [13] which are used to train machine learning models.But, almost all such studies utilizing "real-world" driving scenarios have been conducted during daylight when ample ambient light is present.Even if infra-red cameras are used to conduct such experiments at night, vision modality suffers from occlusion and widely varying changes in illumination [10], both of which are not uncommon in such scenarios.Furthermore, it has been shown that the use of EEG can classify hazardous vs. non-hazardous situations over short time periods which is not possible with images/videos [14].
On the other hand, if we focus on the bio-sensing hardware, more a decade ago, such studies in driving scenarios that utilized the use of bio-sensing modalities suffered from impracticality in "real-world" situations.This is because the bio-sensors were usually bulky, required wet electrodes, and were very prone to noise in the environment.Hence, the studies carried out with such sensors required wet electrode application and monitors in front of participants with minimal motion [15], [16].In the early years of this decade, such bio-sensing systems gave way to more compact ones capable of transmitting data wirelessly while being more resistant to the noise by better EM shielding and advances in mechanical design.Finally, recent advances have led to the development of multi-modal bio-sensing systems and the ability to design algorithms utilizing the fusion of features from various modalities.This has been utilized for various applications such as in affective computing and virtual reality [17], [18].
The use of deep learning for various applications relating to driver safety and autonomous driving systems has skyrocketed in the past few years.These studies have ranged from understanding driving behavior [19] to autonomous driving systems on highways [20] to detecting obstacles for cars [21] among other applications.All such studies only use vision modality since as pointed out before due to the prevalence of large-scale image datasets.However, the use of "pretrained" neural networks for various applications [22], [23] may provide a new opportunity.Hence, if bio-sensing data can be represented in the form of an image, it should be possible to use such networks to extract deep learning-based optimal feature representation of the image (henceforth called most significant features) even in the absence of large-scale biosensing datasets.Finally, our system pipeline uses multiple bio-sensing modalities in addition to the vision which is not the case with previous state-of-the-art evaluations done on the KITTI dataset [24].These research studies [14], [25] use a single modality and traditional features (i.e.not based on deep neural networks) for classification.Through our evaluation, we show that we easily beat their results with higher-order features and also evaluate our pipeline on a new driving dataset generated by us.

III. RESEARCH METHODS
In this section, we discuss the various research methods that we employed to pre-process the data and extract features from each of the modalities used in this study.

A. EEG-based Feature Extraction
The cognitive processes pertaining to attention and mental load such as while driving are not associated with only one part of the brain.Hence, our goal was to map the interaction between various regions of the brain to extract relevant features related to attention.The EEG was initially recorded from 14-channel Emotiv EEG headset at 128 Hz sampling rate.We used artifact subspace reconstruction (ASR) pipeline in the EEGLAB [27] toolbox to remove artifacts related to eye blinks, muscle movements, line noise, etc. [28].This pipeline is capable of working in real-time and unlike ICA [9] has the added advantage of being able to remove noise without much loss of EEG data when a very large number of EEG sensors are not present.For each subject, we verified the output from ASR manually as well as observing the algorithm's output parameters to make sure that the noise removal is being performed correctly.Then, we band-pass filtered the EEG data between 4-45 Hz.On this processed EEG data, we employed two distinct and novel methods to extract EEG features that capture the interplay between various brain regions to map human cognition.

1) Features based on Mutual Information:
To construct the feature space that can map the interaction of EEG information between various regions of the brain, we calculated the mutual information between signals from different parts of the brain.EEG-based mutual information features were used since they measure the changes in EEG across the various regions of the brain as opposed to power spectrum-based features which are local.This is because unlike a steady state visual evoked potential (SSVEP) or a P300 type of EEG response which mostly effects a single brain lobe, driving is a higher level cognition task and hence multiple systems of the brain are involved: vision, auditory, motor, etc.The mutual information I(X; Y ) of discrete random variables X and Y is defined as The desired feature of conditional entropy H(Y |X) is related to the mutual information I(X; Y ) by We calculated the conditional entropy using mutual information between all possible pairs of EEG electrodes for a given trial.Hence, for 14 EEG electrodes, we calculated 91 EEG features based on this measure.Fig. 1 shows the locations of the 14 EEG channels used in our study.The region in the frontal lobe is highlighted in red to show the type of interaction between electrodes that is being measured by the use of conditional entropy features.
2) Features based on Deep Learning: The most commonly used EEG features are the calculation of power-spectrum density (PSD) of different EEG bands.But, these features in themselves do not take into account the EEG-topography i.e. the location of EEG electrodes (as shown in Fig. 1) for a particular EEG band.Hence, we try to exploit EEG-topography to extract information regarding the interplay between different brain regions.
We calculated the PSD of three EEG bands namely theta (4-7 Hz), alpha (7-13 Hz) and Beta (13-30 Hz) for all the EEG channels.The choice of these three specific EEG bands was made since they are the most commonly used bands and thought to carry a lot of information about human cognition.We averaged the PSD for each band thus calculated over the complete trial.These features from different EEG channels were then used to construct 2-D EEG-PSD heatmap for each of the three EEG bands using bicubic interpolation.These heatmaps now contain the information related to EEG topography in addition to spectrum density at each of these locations.Fig. 2 shows these 2-D heatmaps for each of the three EEG bands.As can be seen from the figure, we plot each of the three EEG bands using a single color channel i.e. red, green and blue.We then add these three color band images to get a color RGB image containing information from the three EEG bands.The three color band images are added in proportion to the amount of EEG power in the three bands using alpha blending [31] by giving weights to the three individual bands' images by normalizing them using the highest value in the image.Hence, following this procedure we are able to represent the information in the three EEG bands along with their topography using a single color image.The interaction through the mixture of these three colors (thus forming new colors by adding these primary colors) in various quantities is responsible for information regarding how power in the three bands is distributed at various regions of the brain.
Since it is not possible to train a deep neural network from scratch without thousands of trails from the EEG data (and no such dataset currently exists in driving scenario), the combined colored image representing EEG-PSD with topography information is then fed to a pre-trained deep-learning based VGG-16 convolution neural network [32] to extract features from this image.This network consists of 16 weight layers and has been trained with more than a million images for 1,000 object categories using the Imagenet Database [33].Previous research studies [22], [23] have shown that using features from such "off-the-shelf" neural network can be used for various classification problems with good accuracy.Even for the research problems where the neural networks were trained on a different vision-based problem and applied to a totally different application they still worked very well [23], [34], [35].This is mostly because the low-level features such as texture, contrast, etc. reflected in the initial layers of the CNN are ubiquitous in any type of images.The EEG-PSD colored image is resized to 224×224×3 for input to the network.The last layer of the network classifies the image into one of the 1000 classes but since we are only interested in "offthe-shelf" features, we extract 4,096 features from the last but one layer of the network.The EEG features from this method are then combined with those from the previous one for further analysis.

B. PPG-based feature extraction
PPG measures the changes in blood volume in the microvascular tissue bed.This is done in order to assess the blood flow as being modulated by the heart-beat.Using a simple peak detection algorithm on the PPG signal, it is possible to calculate the peaks of the blood flow and measure the subject's heart-rate in a much more wearable manner than a traditional electrocardiogram (ECG) system.The PPG signal was recorded using an armband (Biovotion) that measures PPG at a sampling rate of 51.2 Hz.
1) HRV and statistical time-domain features: Heart-rate variability (HRV) has shown to be a good measure for classifying cognitive states such as emotional valence and stress [37].HRV is much more robust than heart-rate (HR) which changes slowly and generally only correspond to physical exertion.A moving-average filter with a window length of 0.25 seconds for filtering the noise in the PPG data was first used for each trial.The PPG data so obtained was then scaled between 0 and 1 and then a peak-detection algorithm [38] was applied to find the inter-beat intervals (RR) for the calculation of HRV.The minimum distance between successive peaks was taken to be 0.5 seconds to remove any false positives as in Fig. 3. HR is defined as the total number of peaks per minute in the PPG.To obtain HRV from the PPG we utilize the inter-beat interval (RR) between successive peaks.pNN50 algorithm [39] was then used to calculate HRV from RR intervals.
To explore the statistics related to the PPG wave itself in time-domain we calculated six statistical features on the PPG wave as defined in [40].These features map various trends in the signal by the calculation of mean, standard deviation, etc. at first and subsequent difference signals formed using the original signal.
2) Spectrogram Deep Learning features: Recent research studies have shown the applications of PPG analyzed in the frequency domain for blood pressure estimation and gender identification implying that it might be useful to analyze PPG in frequency domain [41], [42].The frequency range of PPG signals is low and hence we focus only on 0-5 Hz range.Fig. 3 shows the generated frequency spectrogram [43] for this frequency range for the PPG signal in a trial.The different color values generated using the 'Parula' color-map shows the intensity of the spectrogram at a specific frequency bin.Then, we resized the spectrogram images to feed them to the VGG-16 network (as we did above for the color EEG-PSD images), and after which the 4,096 extracted features were extracted from the VGG-CNN network.Time-domain statistical and HRV features from the method above were concatenated with these features for further analysis.

C. GSR-based feature extraction
Feature extraction pipeline on the GSR signal was similar to that on the PPG.The same two methods that were applied on the PPG were utilized for GSR too.Same as PPG, the signals are sampled at 51.2 Hz by the device.
1) Statistical features: The GSR data was first low-pass filtered with a moving average window of 0.25 seconds to remove any bursts in the data.Eight features based on the profile of the signal were then calculated.The first two of these features were the number of peaks and mean of absolute heights of the peaks in the signal.Such peaks and their time differences may prove to be a good measure of arousal.The remaining six features were calculated as in [40] like the PPG signal above.
2) Applying Deep Learning to the spectrogram : Since GSR signals change very slowly we focus only on the 0-2 Hz frequency range.We then generate the spectrogram image for GSR in the above frequency range for each trial.We then send the spectrogram image to the VGG-16 deep neural network and extract the most significant 4,096 features from the same.These features are then concatenated with the features from the time-domain analysis.

D. Facial expression-based feature extraction
As discussed above, the analysis of facial expressions has been the preferred modality for driver attention analysis.Hence, our goal has been to use this method to compare it against the bio-sensing modalities.Furthermore, most of the research work in this area has been done by tracking fixed localized points on the face based on face action units (AUs).Hence, below we show a novel deep learning based method to extract relevant features from the faces for driver attention and hazardous conditions detection.
First, we extract the face region from the frontal body image of the person captured by the camera for each frame.This is done by fixing a threshold on the image size to reduce its extreme ends and placing a threshold of minimum face size to be 50×50 pixels.This was done to remove any false positives and decrease the computational space for face detection.We then used the Viola-Jones object detector with Haar-like features [44] to detect the most likely face candidate.
1) Facial points localization based features: Face action units (AUs) has been used for a variety of applications ranging from affective computing to face recognition [46].Facial Action Coding System (FACS) is the most commonly Fig. 4. Detected face (marked in red) and face localized points (marked in green) for two participants (left and center) in the study, and some of the features (marked in yellow) computed using the coordinates of the face localized points.These features were then normalized using the size of the face in the camera i.e. number of pixels in height (H) and width (W) used method to code the facial expressions and map them to different emotional states [47].Our goal was to use face localized points similar to the ones used in FACS without identifying the facial expression such as anger, happiness, etc. since they are not highly relevant in driving domain and short time intervals.The use of FACS needs first to identify multiple facial landmarks which are then tracked to map the changes in the facial expressions.We applied the state-of-the-art Chehra algorithm [45] to the extracted face candidate region from above.This algorithm outputs the coordinates of 49 localized points (landmarks) representing various features of the face as in Fig. 4. The choice of this algorithm was done because of its ability to detect these landmarks through its pre-trained models and hence not needing training for any new set of images.These face localized points are then used to calculate 30 different features based on the distances such as between center of the eyebrow to the midpoint of the eye, between the midpoint of nose and corners of the lower lip, between the midpoints of two eyebrows, etc. and angles between such line segments.To remove variations by factors such as distance from the camera and face tilt, we normalized these features using the dimensions of the face region.All these features were calculated for individual frames, many of which make a trial.Hence, to map the variation in these features across a trial (which may directly correspond to driver's attention and driving condition) we calculate the mean, 95 th percentile (more robust than maximum), and standard deviation of these 30 features across the frames in the trial.In this manner, we get 90 features based on face-localized points from a particular trial.
2) Deep Learning-based features: For the extraction of deep learning-based features, we use the VGG-Faces networks instead of VGG-16.This is done to extract features more relevant to faces since the VGG-Faces network has been trained on more than 2.6 million face images from more than 2,600 people rather than on various object categories in the VGG-16 network.We send each face region part to the network and extract the most significant 4,096 features.To represent the changes in these features across the trial i.e. across the frames, we calculate the mean, 95 th percentile, and standard deviation of the features across the frames in a trial.We then separately analyze the features from this method to those from the traditionally used face-localized points-based method from above to compare the two.

E. Assessing trends of EEG features using Deep Learning
The EEG features discussed in the section III.B above were computed over the whole trial such as by generating a single EEG-PSD image for a particular trial.This is a special case when the data from the whole trial is being averaged.Here, we propose a novel method to compute the trend of EEG features i.e. their variation in a trial based on deep learning.To compute features with more resolution we generate multiple EEG-PSD images for successive time durations in a trial.Fig. 5 shows the network architecture for this method.The EEG-PSD images are generated for multiple successive time durations in a trial each of which is then sent to the VGG-16 network to obtain 4,096 most significant features.Similarly, this process is done for conditional entropy features by calculating this over multiple time periods in a trial rather than once on the whole trial.We then use principal components analysis (PCA) [36] to reduce the feature size to 60 to save computational time in the next step.These 60×N (N = number of successive time intervals) features are then sent as input to a Long Short Term Memory (LSTM) network [49].The LSTM treats each of these features as a time-series and is trained so as to capture the trend in each of them for further analysis.This method can only be applied when the time duration of the trials is fixed since the length of each time series should be the same.Hence, we apply this method only in the trials used for detecting hazardous/non-hazardous situations.However, since the analysis of such situations is done on a short time intervals basis, we cannot use this method for PPG and GSR modalities since they take a few seconds to react physiologically to the situation.

IV. DATASET DESCRIPTION
In Fig. 6 we show the experimental setup for data collection with driving videos used as the stimulus in our dataset.Twelve participants (most of them in their 20s with two older than 30 years) based in San Diego participated in our study.The participants were comfortably seated equipped with EEG headset (Emotiv EPOC) containing 14 EEG channels (sampling rate of 128 Hz.) and an armband (Biovotion) for collecting PPG and GSR (sampling rate of 51.2 Hz.).This EEG headset was chosen since it is easily wearable and does not require the application of electrode gel making it ideal for use in the driving context but at the cost of very high sampling rate and a greater number of EEG channels.The positioning of the GSR sensor is however sub-optimal since we do not place the sensor at the palm or the feet.This choice was driven by the practicality of data collection in the driving scenario since users interact with multiple vehicle modules from their palms and feet during driving.The facial expressions of the subject were recorded using a camera in front of him/her.The participants were asked to use a driving simulator which they were instructed to control as per the situation in the driving stimulus.For example, if there was a "red light" or "stop sign" at any point in a driving stimulus video, the participants should press and hold the brake.
For consistency between our work and other previous studies [14], [25], we use 15 video sequences from the KITTI  dataset [24].These video sequences range from 14 to 105 seconds.These videos in the dataset were recorded at 1242×375 resolution at 10 frames-per-second (fps).We resized the videos to 1920×580 to fit the display screen in a more naturalistic manner.These video sequences were chosen based on external annotation by two subjects to judge them based on potential hazardous events in them.But, video sequences from KITTI dataset suffer from three limitations namely low resolution, low fps, and few sequences of driving on highways.Additionally, since the videos in KITTI dataset were captured at 10fps it may produce steady-state visual-evoked potential (SSVEP) effect in EEG [26].This is undesirable since we only focus on driver attention and hazardous/non-hazardous events analysis and SSVEP might act as a noise in the process.
Hence, we collected our own dataset of 20 video sequences containing real-world driving data on freeways and downtown San Diego, California.This dataset was collected using our LISA-T vehicle testbed in which a Tesla Model 3 is equipped with 6 external facing GoPro cameras.It is also to be noted that while capturing these videos the vehicle was in the autonomous driving mode making LISA dataset the first of its kind.The cameras were operating at 122 degrees fieldof-view which is very representative of the human vision.Furthermore, we presented these video sequences on a large screen (45.9 inches diagonally) at a distance of a meter from the participants to model real-world driving scenario.These video sequences ranged from 30 to 50 seconds in length and were shown to the participants with 1920×1200 resolution at 30 fps.External annotation was done to classify parts of the video sequences from both datasets into hazardous/nonhazardous events.For example, an event where a pedestrian suddenly appears to cross the road illegally was termed hazardous whereas an event where a stop sign can be seen from a distance and the vehicle's speed is decreasing was termed non-hazardous.External annotation was performed to classify every video sequence into how attentive the driver ought to be in that particular sequence.

V. QUANTITATIVE ANALYSIS OF MULTI-MODAL BIO-SENSING AND VISION SENSOR MODALITIES
In this section, we present the various singular modality and multi-modal evaluation results for driver attention analysis and hazardous/non-hazardous instances classification.First, the videos in both datasets were externally annotated by two annotators for low/high driver attention required.For example, the video instances where the car is not moving at all were characterized as low attention instances whereas driving through narrow streets with pedestrians on the road were labeled as instances with high driver attention required.Hence, among the 35 videos, 20 were characterized as requiring lowattention and 15 as high-attention ones.
Second, 70 instances, each two second long were found in the videos and were characterized as hazardous/nonhazardous.Fig. 8 presents some examples of instances from both categories.As an example, a pedestrian suddenly crossing the road "unlawfully" or a vehicle overtaking suddenly represents hazardous events whereas "red" traffic sign at a distance and a pedestrian at a crossing with ego vehicle not in motion are examples of non-hazardous events.Among the 70 instances, 30 instances were labeled as hazardous whereas rest were labeled as non-hazardous.Hence, the goal is to classify such instances in a short time period of two seconds using the above modalities.Since PPG and GSR have low temporal resolution and do not reflect changes in such short time intervals, we used only facial features and EEG for hazardous/non-hazardous event classification.
For each modality, we first used PCA [36] to reduce the number of features from the above algorithms to 30.We then used extreme learning machines (ELM) [50] for classification.The choice of using ELM over other feature classification methods was driven by previous studies that have shown how ELM's performs better for features derived from bio-signals [51], [52].These features were normalized between -1 and 1 across the subjects before training.A single hidden layer ELM was used with triangular basis function for activation.For the method with trend based temporal EEG and face feature data, we used two layer LSTM with 200 and 100 neurons in respective layers instead of ELM for classification.The LSTM network's training was done using stochastic gradient descent with a momentum (SGDM) optimizer.We performed leaveone-subject-out cross-validation for each case.This meant that the data from 11 subjects (385 trials) were used for training at a time and the classification was done on the 35 trials from the remaining 12 th subject.This choice of cross-validation was driven by two factors.First, this method of cross-validation is much more robust and less prone to bias than models such as leave-one-sample-out cross-validation that constitutes training data from all the subjects at any given time.Second, since the data contained 420 trials only as opposed to thousands of trials for any decent image-based deep-learning dataset, it does not make sense to randomly divide such a small number of trials to training, validation and test sets since it might introduce bias by uneven division across trials from individual subjects.
Both of the feature classification methods i.e.LSTM-based and ELM-based were used independently for feature classification with labels.When a higher temporal resolution was taken into consideration i.e. trends in a series of EEG-PSD images, then LSTM-based method was used for feature classification.This is because now the features vary as a time series for each trial and ELM cannot be used for such a time-series based classification.The ELM-based method was performed for the other case when the data from the complete trial is represented by a single (non-varying in time) value for each feature.

A. Evaluating attention analysis performance
In this section, we evaluate single and multi-modality performance for assessing the driver's attention across the video trials.For all the four modalities, the features as defined above were calculated for data from each video trial.The ELM-based classifier was then trained based on each video trial divided into one of the two classes representing low-attention and highattention required by the driver.
1) Singular modality analysis: To compare the performance among the different modalities, the number of neurons in the  hidden layer was set to 170 for each of the modality.In Fig. 9 we show these results.Clearly, EEG performs the best among the four modalities for driver attention classification.The average classification accuracy for EEG, PPG, GSR, and face-videos are 95.71 ± 3.95%, 81.54 ± 6.67%, 56.02 ± 3.04%, and 80.11 ± 3.39% respectively.Hence, GSR performs only at about chance level (may have to do with its suboptimal positioning on the body in driving context as pointed out by a reviewer) whereas on average PPG and face videos perform equally well.Hence, we see that the sensor modalities with good temporal resolution i.e.EEG and vision perform better than the ones with low temporal resolution (PPG and GSR) thus evaluating our first hypotheses.We can see that for all the subjects except one, EEG's classification accuracy is above 90% while for three modalities (EEG, GSR, and vision) the standard deviation in performance across the subjects is not too high.2) Multi-modality analysis: Since EEG performs best among the four modalities by far, we do not expect much further increase in classification accuracy while combining it with other modalities that perform much worse.Fig. 10 shows that on combining EEG with PPG and GSR there is no increase in the performance across the subjects (it might be that for a few subjects this is not the case).On the contrary, when the features from the low-performing (and poor temporal resolution) modalities i.e.PPG and GSR are combined with EEG, the performance is not as good as EEG alone for most of them.The mean accuracy across all the subjects were 92.58±3.96%,80.11±3.39%,and 80.01±6.78%for the three cases respectively, all of which were significantly above the chance accuracy.Hence, we see that it is not always beneficial to use features from multiple sensor modalities.For most of the subjects and modalities, the fusion of features does not perform better at all and hence may not be advantageous in this case.We think that this is because of the vast difference in the performance of each modality when used independently, based on the subject's physiology.This leads to an increase in performance for some of the subjects but not for all.

B. Evaluating hazardous/non-hazardous incidents classification
In this section, we present the results of the evaluation of the modalities over very short time intervals (2 seconds) pertaining to hazardous/non-hazardous driving incidents as shown in Fig. 8. Since GSR and PPG do not provide such a fine temporal resolution, we do not use these modalities for this evaluation.This is because GSR changes very slowly i.e. take more than a few seconds to vary and PPG for a very short time period such as 2 seconds would mean only 2-4 heartbeats which are not enough for computing heart-rate or heart-rate variability.Previous studies to assess human emotions using GSR and PPG on the order of multiple seconds (significantly greater than two seconds hazardous incident evaluation for driving context) [53].Also, it is not possible for the subjects to tag the incidents while they are participating in the driving simulator experiment and hence these incidents were marked by the external annotators as mentioned above in Section V. 1) Single-modality analysis: Fig. 11 shows the results for classifying hazardous/non-hazardous incidents using EEG and face-expression features.As we can see from the figure, the accuracy for both modalities for all the subjects is well above chance level (50%).It is interesting to note that depending on the subject, one of the two modalities outperforms the other one though features from EEG outperforms face-based features for half of the participants.This proves our earlier point: it may be so that bio-sensing modality (here EEG) may outperform vision modality depending on the user's physiology.This variation in results is natural since some people tend to be more expressive with their facial expressions while on the other hand, the "perceived hazardousness" of a situation varies across subjects.The mean accuracy among subjects were 91.43 ± 5.17% and 88.10 ± 3.82% for EEGand face-based features respectively.Since the evaluation was done on 2-second time intervals i.e. without a lot of data we note that such a high mean accuracy for both modalities was only possible due to using deep-learning based features in addition to the traditional features for both modalities.This is further substantiated by the fact that we used an EEG system with a much lesser number of channels than such previous studies using EEG [14].We show that using such deep learning features our method outperforms the previous results for EEG on the KITTI dataset [14], [25] in a similar experimental setup with hazardous/non-hazardous event classification.
2) Multi-modality analysis: In this section, we present the results for classifying hazardous/non-hazardous incidents combining features from EEG and face modalities.We do this in two ways.First, directly combining the features from single modality analysis by concatenating them.Second, as mentioned in Section III.E, we use an LSTM classifier over the features from both modalities calculated for every frame in the 2-second long sequence.The trend of these features is then fed to LSTM for training.Fig. 12 shows the results of the two approaches.As is clear from the figure, combining the features from the two modalities may or may not increase the performance further compared to using individual modalities shown earlier in Fig11.But, we not that that on taking the trend of features i.e. increased temporal resolution into account, the performance of combining the modalities increases further for most of the subjects.The average accuracy across subjects being 92.38 ± 4.10% and 94.76 ± 3.41% respectively are also more than for singular modality analysis.Hence, to use multiple modalities with high temporal resolution (EEG and vision) may prove to be best when computing features over short time durations with their trend.The average accuracy across subjects using EEG and face-based features for driver attention analysis over the whole video and 2-second hazardous/nonhazardous incident classification.EEG features generally outperform Face features for both cases.Using LSTM i.e. better temporal resolution also increases the accuracy.LSTM could not be used in attention analysis since the duration of the videos varies widely among the datasets.
Since EEG and Face modalities can be used in shorttime intervals, in Table I we show the mean accuracy across subjects for using EEG and faces i.e. vision modality alone and combining them for the two types of analysis done above.We can see that the performance of EEG combined with faces can be better than when either modality is used independently for hazardous incident analysis when using features from the LSTM i.e. trend over the changes in features.However, adding multiple modalities together without using trend-based LSTM analysis may not prove much beneficial.This answers our second hypotheses by showing that it is beneficial to use a fusion of the modalities if both modalities have a good temporal resolution so as to extract short-duration features over them to map the trend.

VI. CONCLUDING REMARKS
The use of multiple bio-sensing modalities combined with audio-visual ones is rapidly expanding.With the advent of compact bio-sensing systems capable of data collection during real-world tasks such as driving, it is natural that this research area will gather more interest in the coming years.In this work, we evaluated multiple bio-sensing modalities with the vision modality for driver attention and hazardous event analysis.We also presented a pipeline to process data from individual modalities by being able to use pre-trained convolution neural networks to extract deep-learning based features from these modalities in addition to traditionally used ones.In this process, we were able to compare the performance of the modalities against each other while also combining them.As the next step to this preliminary study, we would also like to collect data in the future in more complex and safety-critical situations from "real-world" drives.The current system for data collection, noise removal, feature extraction, and classification works in real-time but the data tagging for hazardous/non-hazardous events is still manual.This makes the relevance of our current system restricted to gaming, attention monitoring, etc. in driving simulators.Hence, we would like to make a model based on computer vision that can automatically predict hazardous/non-hazardous events during a real-world drive instead of manual tagging so that our system can be deployed in "real-world" drives.We would also like to devise a real-time feedback system based on the driver's attention so as to verify our pipeline's performance during the real-world driving scenario.

Fig. 1 .
Fig. 1.Locations of EEG channels.All possible pairs of electrodes used to calculate conditional entropy for mapping interplay between different brain regions.

Fig. 2 .
Fig. 2. PSD heat-maps of the three EEG bands i.e. theta (red), alpha (green), and beta (blue) EEG bands are added according to respective color-bar range to get combined RGB heat-map image.(Circularoutline, nose, ears, and colorbars have been added for visualization only.All units are in Watt per Hz.)

Fig. 3 .
Fig. 3.For a trial, PPG signal with peaks (in red) being detected for the calculation of RRs and HRV (above), and PPG spectrogram (below).

Fig. 6 .
Fig. 6.Experiment setup for multi-modal data collection.(A) EEG Headset, (B) PPG and GSR armband, (C) External camera, and (D) Driving videos displayed on the screen.The subject sits with her/his arms and feet on a driving simulator with which s/he interacts while watching the driving videos.

Fig. 7 .
Fig. 7. Various image instances with varying illumination conditions and type of road (street, single-lane, highway, etc.) from (A) LISA Dataset and (B) KITTI Dataset.

Fig. 8 .
Fig. 8. (A) Examples of 2-seconds incidents classified as hazardous.Examples include pedestrians crossing the street without a crosswalk while the ego vehicle is being driven and another vehicle overtaking suddenly.(B) Examples of 2-seconds incidents classified as non-hazardous.Examples include stop signs and railway crossing signs.For each category, the top images are from KITTI dataset whereas the bottom images are from LISA dataset.

TABLE I SINGLE
VS. MULTI-MODALITY PERFORMANCE EVALUATION