Driver Monitoring of Automated Vehicles by Classiﬁcation of Driver Drowsiness Using a Deep Convolutional Neural Network Trained by Scalograms of ECG Signals

: Driver drowsiness is one of the leading causes of trafﬁc accidents. This paper proposes a new method for classifying driver drowsiness using deep convolution neural networks trained by wavelet scalogram images of electrocardiogram (ECG) signals. Three different classes were deﬁned for drowsiness based on video observation of driving tests performed in a simulator for manual and automated modes. The Bayesian optimization method is employed to optimize the hyperparameters of the designed neural networks, such as the learning rate and the number of neurons in every layer. To assess the results of the deep network method, heart rate variability (HRV) data is derived from the ECG signals, some features are extracted from this data, and ﬁnally, random forest and k-nearest neighbors (KNN) classiﬁers are used as two traditional methods to classify the drowsiness levels. Results show that the trained deep network achieves balanced accuracies of about 77% and 79% in the manual and automated modes, respectively. However, the best obtained balanced accuracies using traditional methods are about 62% and 64%. We conclude that designed deep networks working with wavelet scalogram images of ECG signals signiﬁcantly outperform KNN and random forest classiﬁers which are trained on HRV-based features.


Introduction
Drowsiness is defined as a transitional state fluctuating between alertness and sleep that increases the reaction time to critical situations and leads to impaired driving [1,2]. According to previous studies, driver drowsiness is one of the leading causes of traffic accidents. For example, the National Highway Transportation Safety Administration (NHTSA) reported that drowsy drivers were involved in about 800 fatal crashes in 2017 [3].
Another study announced that about 22-24% of crashes or near-crash risks are contributed by drowsy drivers [4]. The American Automobile Association (AAA) has also reported that about 24% of drivers acknowledged feeling extremely sleepy during driving at least once in the previous month [5].
Moreover, monitoring of driver alertness is an implicit requirement in the forthcoming SAE level of conditional automated driving (level 3) since handing over vehicle control to drowsy drivers is unsafe [6,7]. Various driver drowsiness detection systems (DDDS) have already been proposed in recent studies [8][9][10][11]. In our previous work [2], we developed a method for drowsiness classification only in manual driving mode and by using only vehicle-based data. As the drivers insert no input into the vehicle during automated driving tests, the proposed method in [2] cannot be used in automated driving. Moreover, vehicle-based data can be significantly affected by road geometry and the driving behavior of the specific driver. However, the proposed method in this paper uses the ECG data as inputs to the deep CNNs and can be applied in both manual and automated driving modes. Moreover, biosignals such as ECG can provide more accuracy to detect the onset of drowsiness than vehicle-based data [12,13]. This paper offers a new method using deep neural networks trained by wavelet scalograms of an electrocardiogram (ECG) signal.

Related Works
ECG signals present the heart's electrical activity over time that is typically recorded using attached electrodes to the chest [14]. Figure 1 shows the schematic representation of a standard ECG signal [15]. Heart rate variability (HRV) information is extracted by detecting the R-peaks in the ECG signals and evaluating the fluctuations of the time intervals between adjacent R-peaks [16]. HRV is well-known physiological information that presents the activity of the autonomic nervous system (ANS) [17], fluctuations markedly over a day, and the sleep-wake-cycle [18]. Therefore, it is assumed to be indicative not only of the sleep stages [19] but also of sleepiness as well. HRV has been frequently employed to design a DDDS. For example, Fujiwara et al. [20] developed a system based on eight extracted features from HRV data where multivariate statistical process control was used as an anomaly detection method in HRV data. Results showed that the proposed method detected 12 out of 13 drowsiness onsets and the false-positive rate of the anomaly detection system was about 1.7 times per hour. Huang et al. used machine learning with four different traditional classifiers (support vector machine, K-nearest neighbor, naïve Bayes, and logistic regression) for binary detection of drowsiness by training on time and frequency domain features from HRV data [17]. Results showed that the K-nearest neighbor achieved the best accuracy, which was about 75.5%. To discriminate between the HRV dynamics in two states of fatigue (caused by sleep deprivation) and drowsiness (caused by monotonous driving), two different monitoring systems were proposed in [21] based on features from HRV and respiration signals. One of these systems is a binary classifier (alert/drowsy) for assessing the level of driver vigilance every minute. Another system detects the level of the driver's sleep deprivation in the first three minutes of driving. That study showed that the balanced accuracy of the drowsiness detection system which used only HRV-based features is about 65.5%. However, by adding the features from respiration signals, this system achieved a balanced accuracy of 78.5%, an improvement of about 13%. The balanced accuracy of the sleep deprivation system was also about 75%, and it detected 8 out 13 sleep-deprived drivers correctly. Another study conducted by Buendia et al. [22] investigated the relationship between the drowsiness levels rated with the Karolinska sleepiness scale (KSS) and heart rate dynamics. Results showed that the average heart rate decreased with increasing KSS (which means higher drowsiness levels), whereas heart rate variance increased in drowsy states. Patel et al. [23] also developed a neural network classifier to detect the early onset of driver drowsiness by analyzing the power of low-and high-frequency HRV sub-bands. The spectral image, plotted from the power spectral density of the HRV data, was the input given to the neural network that yielded an accuracy of 90%. In [24], Li and Chung used a wavelet transformation to extract features from HRV signals and compared them with fast Fourier transform (FFT)-based features. Receiver operation curves were used for feature selection and support vector machines as a classifier. The wavelet method outperformed the system designed using FFT. Classification results showed that the wavelet-based feature system achieved an overall accuracy of 95%. Furman et al. [25] reported that HRV activity in the very-low-frequency range (0.008-0.04 Hz) significantly and consistently decreases approximately five minutes before extreme signs of drowsiness can be observed.

Contribution of the Method
Previous studies commonly used hand-crafted techniques or dimensionality reduction methods to extract features from HRV data for driver drowsiness classification. Most commonly, heart rate variability data are derived by the detection of R-peaks in the ECG signal and processing the information of R-peak time points only. However, other segments of the ECG signals (see Figure 1) might also be associated with different levels of drowsiness. Furthermore, previous studies widely used traditional machine learning classifiers to classify driver drowsiness; however, deep neural networks are expected to outperform them if a large data set is available for training. In this study, we first employed the wavelet transformation to generate 2D scalogram images of the ECG signal, which capture timefrequency domain features. These images are inserted as input data to a deep convolutional neural network. Bayesian optimization is applied to optimize the hyperparameters of this network. To compare the results of this approach with previous methods, HRV data is also derived from ECG signals in a common way, and its extracted features are utilized to classify driver drowsiness using two traditional classifiers: K-nearest neighbors (KNN) and random forest.
The rest of this paper is structured as follows: Section 2 explains the experimental setup and the testing procedure that was used to collect the dataset. Section 3 describes the methodology for the classification of driver drowsiness. Section 4 presents the results of the proposed method, discusses the results, and compares them with the outcomes of other algorithms. Finally, Section 5 presents our conclusions and suggests future tasks to improve the proposed method.

Experimental Setup and Testing Procedure
This study utilizes the dataset collected during the WACHSens project, a joint project of the Human Research Institute Weiz, the Graz University of Technology, apptec Factum Vienna, and AVL U.K. The tests were performed in the automated driving simulator of Graz (ADSG) at the Institute of Automotive Engineering, Graz University of Technology. The driving simulator is presented in Figure 2. The following subsections explain the structure of the ADSG, simulated driving test procedure, and definition of ground truth for driver drowsiness. To cancel the external noise and adjust the indoor temperature, ADSG is separated from its surrounding area using an insulating housing cube.

Driving Simulator
In the ADSG, the visual cues are simulated by eight LCD panels, covering 180 degrees of view and the rear screen, which the inner mirror observes. The side mirrors are also implemented in the LCDs covering the side windows. The acoustic cue is simulated by generating engine and wind noise applied at the car's sound system. Moreover, four bass shakers generate the vibration in the car chassis and the driver and passenger seats. Haptic feedback is provided by the Sensodrive TM simulator steering wheel (Weßling, Germany) [26], and an active brake pedal simulator, gas pedal, and gear-shift input are taken from the vehicle unmodified controls. The vehicle dynamics states are calculated by a full vehicle software AVL-VSM TM (Graz, Austria) [27], parametrized with a middle-class passenger car. The vehicle model calculates dynamics states as well as engine speed and torque for the acoustic simulation. Adaptive cruise control (ACC) and lane-keeping assist (LKA) systems are also implemented in this simulator for controlling the vehicle's longitudinal and lateral dynamics during tests on automated driving. The ADSG is surrounded by a noise-and temperature-insulating cube/box. Different features of this simulator were studied in our previous works [28,29].

Participants and Driving Tests Procedure
In this project, different types of physiological data were collected from 92 drivers. These drivers participated in manual and automated driving tests when they were in two different vigilance states: fatigued and rested. This procedure results in four different driving sessions for each participant: fatigued automated driving, fatigued manual driving, rested automated driving, and rested manual driving. In the rested condition, drivers were required to have a full night's sleep before performing the tests. For the fatigued condition, the drivers could choose one of the two following options: (1) extended wakefulness (being awake for at least 16 h continuously before starting the tests in the conditions fatigued automated and fatigued manual) and perform the tests at their usual bedtime, or (2) being sleep-restricted by sleeping a maximum of four hours in the night before the tests. The age and gender of participants were balanced across the sample as presented in Table 1. The Female-60+ group has only 12 participants since we could not hire more still active drivers from this group in the available time frame.
Several biosignals, namely, ECG, EEG, EOG, skin conductivity, and oronasal respiration, were collected using a g.Nautilus TM device (Schiedlberg, Austria; research version) with a sampling frequency of 500 Hz. Facial-based data such as eyelid opening, pupil diameter, and gaze direction were also measured with a sampling frequency of 100 Hz using a SmartEye TM (Gothenburg, Sweden) eye-tracker system installed on the car dashboard. In this study, only ECG signals are employed to classify the driver's drowsiness. The study was conducted according to the ethical guidelines of the Declaration of Helsinki and the General Data Protection Regulation of the European Union. The study protocol was approved by the Ethics Committee of the Medical University of Graz in vote 30-409 ex 17/18 dated 1 June 2018. Written informed consent was obtained from participants before the experiments, and they were compensated with EUR 50 after finishing the sessions. More details of the driving test procedure are described in a previous publication [2].

Ground Truth Definition for Driver Drowsiness
To monitor the participants' driving behavior, four cameras were placed in the ADSG that recorded different views of the driver and the test track (see Figure 3). Traffic psychologists thoroughly observed these videos and assigned labels to the driver's drowsiness level based on drowsiness signs such as yawning, long blinks, and head nodding. The driver's vigilance state is reported in four classes: alert (AL), moderately drowsy (MD), extremely drowsy (ED), and falling asleep event (SL). These drowsiness levels are collected with their corresponding SmartEye TM video frame numbers to synchronize drowsiness level ratings with the recorded data channels (more details of data synchronization are explained in Section 3.1). Figure 4 shows an example of the defined ground truth for driver drowsiness in all four driving tests (all performed by the same driver). As that Figure shows, micro-sleep events (SL) were also reported by video observers. However, we merged the SL class with the extremely drowsy (ED) class since the overall number of SL samples was too small to be considered as a separate class for machine learning training. This figure also shows that even in the rested condition, some drivers showed signs of moderately and extremely drowsy states. More details of the ground truth definition for driver drowsiness using video observations are explained in our previous publication [30]. . Four different views of the driver and the test track. These views were observed thoroughly by an expert to define a ground truth for driver drowsiness based on drowsiness signs into three classes (informed consent was obtained from the driver to publish his image in this paper; reprinted from our previous study [2], license no. 5218171384545).

Methodology
Two different methodologies are employed to classify driver drowsiness using ECG signals: (1) two traditional classifiers (random forest and KNN) trained by features extracted from HRV signals, and (2) one deep convolutional neural network (CNN) model trained by ECG wavelet scalogram images. The Bayesian optimization method is used to optimize the hyperparameters of the classifiers. Figure 5 shows the flowchart of these methods. The following subsections describe the structure of these methodologies. Figure 5. Two different approaches to classify driver drowsiness using ECG signals: wavelet scalograms or derived HRV features. The hyperparameter of KNN, random forest, and CNN model are optimized using the Bayesian optimization method.

Data Synchronization
Ground truth is defined based on the video observation and recorded using the frame number information of SmartEye TM data collected with a sampling frequency of 100 Hz. Physiological signals were recorded with separate equipment at 500 Hz sampling frequency, but also fed into the central recording module and stored with the same sampling frequency of 100 Hz. The lower sample rate is not sufficient for the high-quality processing of physiological data. Therefore, we need to synchronize video and physiological data sources with the help of the respiration signal which is available at both sampling rates of 100 Hz and 500 Hz. The normalized cross-correlation between the two respiration signals is calculated at all possible lags. The delay between these two signals is calculated as the lag with the largest absolute value of normalized cross-correlation. Figure 6 shows an example of data synchronization where 500Hz-respiration data is shifted about 21.4 s forward to be synchronized with the 100Hz-respiration data. The same time shift is also applied to the ECG signals collected with the sampling frequency of 500 Hz to sync them with the video observations. In those data, offset correction was sufficient for an accuracy ±1 video frame.

ECG Preprocessing
Generally, ECG signals are contaminated with different noise sources such as power line interference (50 Hz) [31] and baseline wander [32]. A second-order infinite impulse response (IIR) notch filter [33] is utilized here to remove the power line noise from ECG signals. Furthermore, a high pass filter with a passband frequency of 0.5 Hz is also employed to remove the low-frequency baseline wander noise. Figure 7 shows one part of the noisy and denoised ECG signals after removing the baseline wander and power line noise.

Driver Drowsiness Classification Using Scalograms of ECG Signals
This section describes the proposed method for driver drowsiness classification using deep neural networks trained by wavelet scalogram images of the ECG signals.

Wavelet Scalogram
Wavelet analysis calculates the correlation (similarity) between an input signal and a given wavelet function ψ(t). Unlike Fourier transform, wavelet analysis provides a multiresolution time-frequency output under the assumption that low frequencies maintain the same characteristics for the whole duration in the input signal. In contrast, high frequencies are assumed to appear at different time points as short events. Therefore, the wavelet function is scaled and translated by two parameters s ∈ R + and u ∈ R to generate a wavelet filter bank, ψ u,s as presented by Equation (1) [34].
By using this transformed wavelet, continuous wavelet transform (CWT) of the input signal x(t) at the time u and scale s can be calculated as: where ψ * (t) is the complex conjugate of ψ(t) and X WT (u, s) provides the frequency contents of x(t) corresponding to the time u and scale s. By using the two parameters of u and s, it is possible to investigate the input x(t) in two domains of time and frequency simultaneously, whereby the resolution of time and frequency depends on the scale parameter s. CWT provides a time-frequency decomposition of x(t) in the time-frequency plane. This method can be more beneficial than other methods such as short-time Fourier transform (STFT) when investigating the non-stationary signals since it provides a higher time resolution in the higher frequencies by reducing frequency resolution, and a higher frequency resolution in lower frequencies by reducing time resolution. In contrast, the time and frequency resolutions are constant in STFT. The scalogram if x(t) in any positive scale is calculated as the norm of X WT (u, s): This equation calculates the energy of X WT at a scale s. Therefore, we can find the significant scales (which correspond to frequencies) in the signal using the scalogram.
The wavelet scalogram is used to transform the time series ECG signal to the timefrequency domain. Here, the Morse wavelet [35] is employed to calculate the wavelet transformation of the ECG signals. A sliding window with a length of 10 s and an overlap of 5 s between two consecutive windows was employed, and the scalogram image of every window of the data is calculated. The resulting number of data windows in each level of driver drowsiness are provided in Table 2 for manual driving, and in Table 3 for automated driving. As these tables show, the generated data sets are imbalanced in both manual and automated modes. This fact must be taken into account in the structure of the deep and traditional classifiers. Moreover, the percentages of MD and ED classes are higher in the automated driving tests than in the manual tests. Thus, the drivers were generally drowsier in automated. Figure 8 shows sample images of ECG signals and their corresponding scalogram images for all three drowsiness levels in a rested automated test. The generated images are resized to 224 × 224 pixels. To reduce the computational complexity of the training process of the deep network, the RGB scalogram images are also transformed to grayscale images as presented in Figure 9. These grayscaled images are used as input data to train the deep convolutional neural network.

Architecture of Deep CNN and Optimization of its Hyperparameters
Convolution neural networks (CNN) have been widely used to learn features from input images in different applications [36][37][38][39]. These networks help to capture the spatial dependencies in different parts of an input image by applying a convolution operation of some specific filters to input images [40]. This study used scalogram images of ECG signals to train a deep CNN and classify the driver drowsiness. As scalograms are timefrequency representations of an underlying time signal, temporal information is coded in the spatial features of the image. The input images are first normalized to have zero mean and unit variance. Then, the whole data set is split randomly into the train (80% of the input data), validation (10% of the input data), and test (10% of the input data) subsets in a way that the percentages of the drowsiness classes are approximately the same as presented in Tables 2 and 3 in each of the subsets.
The utilized deep CNN is composed of five convolutional blocks and one fully connected block in its hidden layer. Convolution and fully connected blocks are presented in Figure 10, where Conv, BN, ReLU, Max Pool, and FC are convolution layers, batch normalization layer, the ReLU activation function (max(x, 0)), and a fully connected layer, respectively. The hidden layer is followed by the output layer that is constructed using an FC layer, a soft-max layer, and a weighted classification layer (weight). The number of neurons in the fully connected layer of the output layer is equal to the number of classes (here, three). The weight layer is used to mitigate the data imbalance issue. This layer contains one element per drowsiness class where every element is calculated using Equation (4): where N c is the number of classes (here, three), C i is the number of data samples that belong to the i-th class, and finally, W i is the weight of i-th class. By applying Equation (4) to the data samples that belong to the drowsiness classes in the manual and automated modes (presented in Tables 2 and 3), the corresponding weights for every class are computed. Table 4 provides these weights. As this table shows, the class weights of the ED class are highest in both manual and automated mode tests. By using these weights, misclassification errors of the MD and ED classes get more weight in comparison to the AL class. Therefore, if the network classifies an MD or ED sample into the AL class wrongly, it results in a large misclassification error that has a significant influence on the optimization process and thus will reduce the frequency of this kind of classification error.   Figure 11 presents the architecture of the deep CNN, where five convolution blocks are followed by one fully connected block. Moreover, one dropout layer is also added after convolution blocks to reduce the possibility of overfitting or getting stuck in local minima during the training process. The dropout layer temporarily eliminates some neurons with a predefined probability, along with all of their input and output connections [41].
Deep neural networks have several hyperparameters such as the learning rate, the regularization parameter, and the number of neurons of filters that can influence network performance. Finding a proper combination of these hyperparameters so that they provide the optimal performance of the deep network is a primary active task in the research field of deep learning [42,43]. In this study, the Bayesian optimization method [44,45] is applied to optimize the hyperparameters of the deep CNN. This method has the capability of reasoning about the iterations' performance before they are performed. Therefore, fewer iterations are needed to find the optimal hyperparameter combination than with other hyperparameter optimization methods [45]. Moreover, this method yields a better generalization on the tests data set [46]. Here, four different hyperparameters were considered to be optimized using the Bayesian optimization method, including learning rate, L2 regularization, dropout probability, the number of filters in convolution layers (Conv1 to Conv5 in Figure 11), and neurons in the fully connected layer in the hidden layers (FC1 in Figure 11). Here, it is assumed that the number of filters in Conv1 to Conv5 and the number of neurons in FC1 are equal, so only one hyperparameter is defined to find their optimal values. Table 5 presents the specified search space for each of these hyperparameters. Figure 11. The architecture of the deep CNN used to classify driver drowsiness using ECG scalogram images.   An adaptive moment estimation (ADAM) optimizer is employed to train the parameters of the designed deep CNNs (weights and biases). The maximum number of epochs is empirically chosen to be 15. A schedule for learning rate is utilized that multiplies the initial learning rate by 0.1 after 12 epochs to alleviate overfitting in the latter training epochs. The size of the mini-batch is defined to be constant and equals 16. The training process was conducted on a system with CPU and GPU types of Intel Core TM ;i7-782HQ and NVIDIA TM Quadro M2200, respectively.

Driver Drowsiness Classification Using Heart Rate Variability Data
This section describes the proposed method for driver drowsiness classification using feature extraction from HRV data.

Derivation of Heart Rate Variability Data from ECG Signals
The heart rate variability signal is derived from preprocessed ECG signals by applying an R-peak detection algorithm to detect heart rate. In this study, we used the automatic multiscale-based peak detection (AMPD) method [47] as an ECG R-peak detector, then RR Intervals (RRIs) that are defined as the time intervals between every two consecutive R-peaks are calculated. Figure 12 shows the detected R-peaks in a part of the ECG signal.

Feature Extraction from HRV Data
Literature has proposed some features to be extracted from RR intervals for driver drowsiness detection [18], which conform to measures that are well established in clinical contexts [48]. Other HRV features are based on a visualization technique called the Poincaré plot. In this subsection, firstly, this plot is introduced, then those and other commonly extracted features from RR intervals are explained.
Poincaré plot: This plot is a type of recurrence plot to investigate the similarity in time series that can be used to analyze the nonlinear properties of HRV data [49]. Consider X = [RR t , RR t + 1 , . . . , RR N ] as a RR interval time series. The Poincaré plot first plots (RR t , RR t + 1 ), then plots (RR t + 1 , RR t + 2 ), then plots (RR t + 2 , RR t + 3 ) and so on. This plot provides information about the short-term and long-term dynamics of the RR interval. An ellipse is fitted to the plotted data points and the minor and major semi-axes of the ellipse are associated with short-term and long-term HRV, respectively. Figure 13 shows the Poincaré plot for RR intervals collected in a rested automated driving test. The least-square method was employed to fit an ellipse on given RR intervals [50] and geometrical properties of this ellipse are extracted as features to describe the HRV dynamics. Three features are extracted from this plot:

1.
SD1: SD1 is the standard deviation of the Poincaré plot perpendicular to the line of identity and the semi-minor axis (half of the shortest diameter) of the fitted ellipse, see Figure 13 (green vector). SD1 is an estimation of short-term HRV that describes parasympathetic activity since it represents the deviation of heart rate from the lineof-identity (constant heart rate).

2.
SD2: SD2 is the standard deviation of the Poincaré plot along the line of identity and the semi-major axis (half of the largest diameter) of the fitted ellipse, see Figure 13 (red vector). SD2 is an estimation of long-term HRV that describes sympathetic and mixed activity since SD2 is along the line of identity. 3. SD1/SD2: SD1/SD2 is the ratio of SD1 to SD2 that describes the ratio of short-term to long-term HRV and the relationship between parasympathetic and sympathetic activity.
Other features that have been proposed by previous studies [18,51,52] are also extracted from RR intervals. These features are:

1.
MeanRR: This feature presents the mean values of the time intervals between every two consecutive R-peaks. The MeanRR is calculated by Equation (5): where N R is the number of heartbeats in the sliding windows and RR i + 1 is equal to the time interval between R i and R i + 1 .

2.
SDRR: This feature represents the standard deviation of RR intervals, calculated by Equation (6).
3. RMSSD: This feature calculates the root mean square of consecutive RR intervals' differences, calculated by Equation (7). It reflects parasympathetic activity.
4. pRR50: This feature measures the ratio of the number of R-peaks that differ more than 50 ms from their next R-peak to the total number of RR intervals in every sliding window. Equation (8) calculates the pRR50.

5.
VLF: This feature presents the power in the very-low-frequency ranges of 0.003-0.04 Hz of the RR interval time series. To calculate this feature and the LF and HF, the PSD of the RR intervals is computed using the Lomb-Scargle periodogram method [53,54] in every sliding window. 6.
LF: This feature presents the power in the low-frequency ranges of 0.04-0.15 Hz of the RR interval time series. 7.
HF: This feature presents the power in the high-frequency ranges of 0.15-0.40 Hz of the RR interval time series and reflects parasympathetic activity. 8.
LF/HF: This feature is the ratio of LF divided by HF and is also indicative of the sympathetic-parasympathetic balance.
Overall, eleven features are extracted from HRV data and are used to classify the driver's drowsiness. A window length of ten seconds, which was used in the ECG scalo-gram approach above, is considered extremely short for evaluation of HRV as it conforms nearly exclusively to the fast fluctuations of heart rate according to parasympathetic activity [55]. For exploratory purposes, we also applied longer sliding windows in comparison to the deep learning model. Two additional sliding windows are employed: (1) 60 s with 30 s overlap, and (2) 40 s with 20 s overlap. These longer windows help to provide a better estimation of mid-range dynamics of HRV data for classifiers than we can expect from short windows that are used in the deep learning method. The results of these sliding methods are compared together in Section 4.1.
The following subsection explains the two classifiers (KNN and random forest) used for drowsiness classification.

Classify Driver Drowsiness Using Traditional Classifiers
The KNN and random forest are employed to classify the driver drowsiness using extracted features from HRV data. Each one of these classifiers has two different hyperparameters. The KNN hyperparameters are the number of neighbors for every sample (numNei) and the function used to measure the distance between samples (distance) [56]. The random forest hyperparameters are also the minimum of leaf size (minLS) and number of predictors to sample at each node (numPTS) [57]. These hyperparameters are also optimized using the Bayesian optimization method to find the optimal set. Moreover, the issue of the imbalanced data set is removed by using the uniform prior probability of every class for the KNN [58] and random under-sampling boosting (RUSBoost) [59] for the random forest classifier.

Results and Discussion
To evaluate the generated classifiers, confusion matrices were calculated for the test dataset. These matrices provide four different values that are computed for every drowsiness level:

1.
True-negative (TN): The number of samples that do not belong to a specific class (for example, AL) and are also classified in any of the two other classes (MD or ED) by the classifier.

2.
True-positive (TP): The number of samples that belong to a specific class (for example, AL) and are correctly classified in that class.

3.
False-negative (FN): The number of samples that belong to a specific class (for example, AL) but are wrongly classified in any of the two other classes (MD or ED).

4.
False-positive (FP): The number of samples that do not belong to the specific class (for example, AL) and are wrongly classified in that class.
These four values are used to calculate five different metrics for every level of driver drowsiness:

1.
Specificity (true negative rate): The specificity is TN divided by the sum of TN and FP. It can be interpreted as the probability of a sample not being classified in a class if it does not belong there 2.
Sensitivity (true positive rate): The sensitivity is TP divided by the sum of TP and FN. 3.
Precision (positive predictive value): The precision is TP divided by the sum of TP and FP. 4.
F1-score: The F1-score is the harmonic mean of precision and sensitivity. 5.
Balanced accuracy: The balanced accuracy is equal to the average of the accuracies of the three classes. The accuracy of every class is also equal to the ratio of TP of every class to the number of samples that belong to the corresponding class based on the actual labels.
The following subsections present the results of the two proposed methods for driver drowsiness classification.

Results of Driver Drowsiness Classification Using Heart Rate Variability Data
To evaluate the performance of the KNN and random forest classifiers, confusion matrices of these classifiers trained by HRV-based features with three different sliding windows of 10 s, 40 s, and 60 s are provided in Figures 14-16, respectively. In these Figures, the diagonal elements (in gray) provide the number of the sliding windows that are correctly classified in different classes of drowsiness, according to the ground truth classification from the video observations. Accordingly, the percentage numbers written in these elements show correct classification accuracy for every specific drowsiness level. Non-diagonal cells also present the number of samples that are misclassified. As these figures present, the classification accuracy of the MD class is lower than two other classes in the manual mode. Furthermore, random forest performs better than KNN for drowsiness classification in manual and automated modes regardless of the used sliding window. Balanced accuracies for every classifier in every driving mode are provided in Table 6. These accuracies are calculated as the average TP accuracies in confusion matrices that are shown in grey elements in Figure 16c,d. Therefore, the best balanced accuracy that is achieved using traditional methods in manual mode is the average of 64.7%, 56.2%, and 56.4%. The best balanced accuracy in the automated mode that is achieved by the same methods is also the average of 63.4%, 63.5%, and 66.6%. According to this table, the best balanced accuracies in the automated and manual modes are respectively 63.8% and 62.1%, which are obtained using the random forest classifier and the sliding window of 60 s with 30 s overlap.    For the sake of brevity, classification metrics including specificity, sensitivity, precision, and F1-score are shown only for the best classifier (random forest trained by HRV-based features with a 60 s sliding window) in the manual and automated modes. Table 7 presents these metrics. As this table shows, the precision value for the ED class is low. This has occurred because the number of TP is low for this class. According to this table, the AL class has the maximum F1-score in both manual and automated modes. Therefore, the accuracy of the random forest for the AL class is higher than two other classes, and accordingly, the false alarm of this is reduced using this classifier.

Results of Driver Drowsiness Classification Using Scalogram of ECG Signals
As presented in Section 3.3.2, four different hyperparameters of deep CNN are considered to be optimized using the Bayesian optimizer. Table 8 shows the optimized hyperparameters of deep CNNs in the manual and automated driving modes. As this Table shows, the number of filters in the convolution layers and neurons in the fully connected layer (presented by the hyperparameter H 4 ) is higher in the automated driving mode. Therefore, the computational cost is higher in the automated tests to classify driver drowsiness using the proposed deep CNNs. The L2 regularization value (presented by the hyperparameter H 3 ) is much higher in manual tests than in automated tests. Thus, deep CNN needs larger parameters to classify the driver drowsiness in the manual tests. The dropout probability (presented by the hyperparameter H 2 ) of the trained deep CNN for the ECG signals in the automated tests is higher than the designed deep CNN for the manual tests. The number of neurons is also higher for the deep CNN trained by the ECG signals for automated driving. Therefore, its network is wider than another one. Consequently, the dropout probability of the network trained by the ECG signals of the automated tests should also be higher to turn off more neurons and avoid overfitting. In comparison with other widely used deep CNNs that are implemented in embedded systems for real-time face recognition or object detection, our developed deep CNNs have much fewer parameters. Table 9 compares the number of parameters of four different frequently used deep networks in real-time applications (AlexNet [60], VGG16 [61], ResNet18 [62], and GoogLeNet [63]) with our developed networks. Confusion matrices of the trained deep CNNs using ECG signals of the manual and automated tests are provided in Figure 17 to evaluate their classification performance. As this Figure shows, the MD class and AL class have the lowest and highest classification accuracy in both manual and automated driving modes, respectively. Therefore, reducing the number of classes from three to two can increase classification accuracy. However, it will not be possible to capture the transition between the AL to ED states in the case of binary classification. The balanced accuracy of the deep CNNs in both manual and automated modes is also provided in Table 10. These accuracies are calculated as the average TP accuracies in confusion matrices that are shown in grey elements in Figure 17a,b. Therefore, the balanced accuracy in manual mode is the average of 81.2%, 78.6%, and 79.1%. The balanced accuracy in the automated mode is also the average of 82.2%, 73.8%, and 82.0%. According to Table 10, the balanced accuracy of the deep CNN in the manual and automated driving modes are respectively about 77% and 79%. By comparing Table 10 with Table 6, deep CNNs significantly outperform the random forest and KNN methods in both manual and automated modes. Therefore, the input ECG scalograms are more informative than HRV-based features regarding driver drowsiness levels.
Classification metrics of the deep CNNs in the manual and automated modes are provided in Table 11. Comparing Table 11 with Table 7 shows that the F1-scores of all classes in both driving modes except the AL class in manual mode are improved by using the deep CNN method. According to this table, the precision value for the ED class is also lower than other classes since the numbers of the data samples of this class are much lower than the MD and AL classes.

Conclusions
Two different methodologies were proposed in this paper to classify driver drowsiness using ECG signals. In the first methodology, R-peaks are firstly detected from ECG signals to obtain the HRV data. Then, eleven features are extracted from HRV data, and finally, random forest and KNN are used to classify drowsiness into three classes: alert, moderately drowsy, and extremely drowsy. In another method, a deep CNN was used to classify the drowsiness to the same classes when wavelet scalogram images of the ECG signals were inputs to this network. Results showed that the classification with deep CNN on ECG scalograms was more accurate than the random forest and KNN classifiers on HRV in both manual and automated driving modes. It is noteworthy that the length of ECG signals for the scalograms was only 10 s. For direct comparison, we also calculated HRV features on 10 s windows, though we are aware that this time frame captures fast, mostly respiratory, fluctuation only. Time frames from at least 1-2 min or even longer are necessary for a good agreement to usual short-term HRV measures [55,64]. We also computed longer time frames of 40 s and 60 s to verify the hypothesis that these longer windows capture more relevant information. Indeed, the classification accuracy of both KNN and RF classifiers increases with the duration of the time window used for HRV calculation.
In contrast, the deep CNN on ECG scalograms performs better already based on 10 s windows only. We conclude that the time-frequency content of the entire ECG signal captures information about the autonomous state of an individual beyond the RRI signal, which is the only information used for classical HRV parameters. Further research is suggested to understand which feature of an ECG exactly codes relevant information.
The following tasks are also suggested to improve the designed driver drowsiness classification system:

1.
Applying a quality assessment method to the ECG signals in every sliding window can help to remove noisy data and increase classification accuracy. Moreover, a quality index can be derived for each sliding window to specify its influence on the reported drowsiness level in a predefined time interval (e.g., 1 min).

2.
In this study, ECG signals are collected using attached electrodes to the driver's chest. Non-invasive sensors such as smart watches can also be used to design a non-disturbing system for drivers. However, the accuracy of these devices should be compared with accurate medical sensors, and differences in information gain according to the different characteristics of an ECG to an optical pulse signal need to be evaluated. 3.
The proposed methods in this study developed generic driver drowsiness classification systems that consider no driver-specific differences. Only two hours of data is available for every driver which might not be sufficient to train a driver-specific deep network. To build a driver-specific system, transfer learning [65] can be employed. Using this method, the trained deep CNNs can be fine-tuned for a specific driver using a shuffled portion of their ECG data as the training set. Then, the fine-tuned deep CNN can be used to evaluate the drowsiness for the specific driver in the unseen test set or in real time. This approach can also reduce the amount of data needed from each driver to build a driver-specific system. 4.
In this study, data from signal segments were treated as independent from each other by random selection and the sequential time information was ignored. Though this is presumably an advantage for the ability of a practical system to react fast, the transition from wakefulness to drowsiness might also be considered a continuous slower process. Therefore, it should be evaluated if outcomes of the deep network profit from the inclusion of sequential information of training segments.