Vision-Based Driver’s Cognitive Load Classification Considering Eye Movement Using Machine Learning and Deep Learning

Due to the advancement of science and technology, modern cars are highly technical, more activity occurs inside the car and driving is faster; however, statistics show that the number of road fatalities have increased in recent years because of drivers’ unsafe behaviors. Therefore, to make the traffic environment safe it is important to keep the driver alert and awake both in human and autonomous driving cars. A driver’s cognitive load is considered a good indication of alertness, but determining cognitive load is challenging and the acceptance of wire sensor solutions are not preferred in real-world driving scenarios. The recent development of a non-contact approach through image processing and decreasing hardware prices enables new solutions and there are several interesting features related to the driver’s eyes that are currently explored in research. This paper presents a vision-based method to extract useful parameters from a driver’s eye movement signals and manual feature extraction based on domain knowledge, as well as automatic feature extraction using deep learning architectures. Five machine learning models and three deep learning architectures are developed to classify a driver’s cognitive load. The results show that the highest classification accuracy achieved is 92% by the support vector machine model with linear kernel function and 91% by the convolutional neural networks model. This non-contact technology can be a potential contributor in advanced driver assistive systems.


Introduction
Today's vehicle system is more advanced, faster and safer than before and is on the process to be fully autonomous. Literature shows that most traffic accidents happen by human error [1]. Therefore, theoretically, a well-programmed computer system or autonomous system can reduce the accident rate [2]. Recently, many automobile industries have launched cars with autonomous level 3 and 4; however, in the development process of autonomous vehicles, human drivers must be present in case of failures of autonomous systems or, if necessary, humanitarian assistance [3]. Hence, the necessity of driver monitoring is rapidly increasing in the transportation research community as well as in vehicle industries.
According to National Highway Traffic Safety Administration (NHTSA), about 94% of all observed accidents occurred in 2018 due to the presence of human error [4] such as higher stress [5], tiredness [6], drowsiness [7,8] or higher cognitive load [9]. A report published in 2015 shows that almost 38% of the total road accidents happen due to the driver's mental distraction [10], which increases cognitive load of the driver. Another driver status called fatigue is the gradually increasing subjective feeling of tiredness of a subject under load. Fatigue can have physical or mental causes and can be manifested in a number of different ways [11]. authors considered different window sizes, but the best performance appeared when the time window size was 30 s [48]. However, a different opinion is seen in [49]; the author suggested that it might be unnecessary to limit the window size. After all, the sampling frequency for eye movement feature extraction depends on the characteristics of the data such as the number and the duration of secondary tasks.

Data Collection
Thirty-three male participants aged between 35-50 (42.47 ± 4.39 years) were recruited for the study. Only males were chosen to obtain homogeneous groups from the population. The regional ethics committee at Linköping University, Sweden (Dnr 2014/309-31) approved the study and each participant signed an informed consent form. The experiment was carried out using a car simulator (VTI Driving Simulator III) (https://www.vti.se/en/research/vehicletechnology-and-driving-simulation/driving-simulation/simulator-facilities) (Accessed date: 26 November 2021) which is shown in Figure 1. tasks and each task was performed for 3 min. Then, features were extracted from the entire 3 min of data, i.e., the sampling frequency was 180 Hz. In [47], a sampling frequency of 15 Hz was considered for calculating eye movement parameter fixation duration. The authors considered different window sizes, but the best performance appeared when the time window size was 30 s [48]. However, a different opinion is seen in [49]; the author suggested that it might be unnecessary to limit the window size. After all, the sampling frequency for eye movement feature extraction depends on the characteristics of the data such as the number and the duration of secondary tasks.

Data Collection
Thirty-three male participants aged between 35-50 (42.47 ± 4.39 years) were recruited for the study. Only males were chosen to obtain homogeneous groups from the population. The regional ethics committee at Linköping University, Sweden (Dnr 2014/309-31) approved the study and each participant signed an informed consent form. The experiment was carried out using a car simulator (VTI Driving Simulator III) (https://www.vti.se/en/research/vehicle-technology-and-driving-simulation/driving-simulation/simulator-facilities) (Accessed date: 26 November 2021) which is shown in Figure  1. The approximate driving time was 40 min, including a practice session of 10 min before the actual driving with cognitive load activity. The driving simulation environment consisted of three recurring scenarios: (1) a four-way crossing with an incoming bus and a car approaching the crossing from the right (CR), (2) a hidden exit on the right side of the road with a warning sign (HE), and (3) a strong side wind in open terrain (SW). In the simulation, the road was a rural road with one lane in each direction, some curves and slopes and a speed limit of 80 km/h. As a within-measure study, each scenario was repeated four times during the driving session where participants were involved in a cognitive load task, i.e., a one-back task, or were driving to pass a scenario (baseline or no additional task). Thus, the cognitive load was annotated as cognitive load class '0' for baseline and cognitive load class '1' for the one-back task. The start and end time of each HE, CR and SW were recorded with a no task and one-back task marker. The one-back task was imposed on drivers by presenting a number aurally every two seconds. The participants had to respond by pressing a button mounted on their right index finger against the steering wheel if the same number was presented twice in a row. The scenarios were designed to investigate the adaption of the driver behavior corresponding to the scenario and cognitive task level (i.e., one-back task).
Two recording systems were used to track and record eye activities. The SmartEye eye-tracking system (http://www.smarteye.se) (Accessed date: 26 November 2021) was The approximate driving time was 40 min, including a practice session of 10 min before the actual driving with cognitive load activity. The driving simulation environment consisted of three recurring scenarios: (1) a four-way crossing with an incoming bus and a car approaching the crossing from the right (CR), (2) a hidden exit on the right side of the road with a warning sign (HE), and (3) a strong side wind in open terrain (SW). In the simulation, the road was a rural road with one lane in each direction, some curves and slopes and a speed limit of 80 km/h. As a within-measure study, each scenario was repeated four times during the driving session where participants were involved in a cognitive load task, i.e., a one-back task, or were driving to pass a scenario (baseline or no additional task). Thus, the cognitive load was annotated as cognitive load class '0' for baseline and cognitive load class '1' for the one-back task. The start and end time of each HE, CR and SW were recorded with a no task and one-back task marker. The one-back task was imposed on drivers by presenting a number aurally every two seconds. The participants had to respond by pressing a button mounted on their right index finger against the steering wheel if the same number was presented twice in a row. The scenarios were designed to investigate the adaption of the driver behavior corresponding to the scenario and cognitive task level (i.e., one-back task).
Two recording systems were used to track and record eye activities. The SmartEye eye-tracking system (http://www.smarteye.se) (Accessed date: 26 November 2021) was the primary device that tracked and captured the eye movements of the drivers, which is considered as ground truth in this paper. The second system was a digital camera that captured the driver's face and upper body. Each driver had the opportunity to agree or disagree that the video recording should be used at seminars or events when signing the informed consent.

Eye-Pupil Detection
The Materials Figure 2 shows a test participant and his detected eye-pupil position. A summary of the eye pupil position detection and extraction through facial images is presented by a flow chart in Figure 3. the primary device that tracked and captured the eye movements of the drivers, which is considered as ground truth in this paper. The second system was a digital camera that captured the driver's face and upper body. Each driver had the opportunity to agree or disagree that the video recording should be used at seminars or events when signing the informed consent.

Eye-Pupil Detection
The Materials Figure 2 shows a test participant and his detected eye-pupil position. A summary of the eye pupil position detection and extraction through facial images is presented by a flow chart in Figure 3.  Initially, video files are converted into images based on the frame size. In Step2, the face is detected from the video images through a region of interest (ROI) using the Viola and Jones algorithm [50] and, to speed up the face detection to the next consecutive image frames, face tracking is applied using the Kanade-Lucas-Tomasi (KLT) algorithm [51]. Details and a technical description of face detection are presented in our previous article [52]. Several image processing tasks are conducted to detect eye pupil positions in the image frames. First, the extracted facial ROI is converted into grayscale images and then the grayscale images are transformed into binary images, imposing a threshold value of 0.5. In the next level, the binary image is converted into an inverse image. Inverse image helps to find the edges of the face which are formed due to presence of eyes, nose and mouth. A Sobel edge detection method is used for detecting these edges in the inverse image. Then, the goal is to find the eyes; to do this, it is detected whether there is any circular region or not. Finally two circles for eyes are detected which provide the center of the circle or the center of the eye position. For better understanding, Algorithm 1 is provided with a simplified pseudocode is also provided below: disagree that the video recording should be used at seminars or even informed consent.

Eye-Pupil Detection
The Materials Figure 2 shows a test participant and his detected A summary of the eye pupil position detection and extraction throu presented by a flow chart in Figure 3.  Initially, video files are converted into images based on the fram face is detected from the video images through a region of interest (R and Jones algorithm [50] and, to speed up the face detection to the nex frames, face tracking is applied using the Kanade-Lucas-Tomasi (KLT tails and a technical description of face detection are presented in our p Several image processing tasks are conducted to detect eye pupil po frames. First, the extracted facial ROI is converted into grayscale im grayscale images are transformed into binary images, imposing a thr In the next level, the binary image is converted into an inverse image. to find the edges of the face which are formed due to presence of ey A Sobel edge detection method is used for detecting these edges in Initially, video files are converted into images based on the frame size. In Step 2, the face is detected from the video images through a region of interest (ROI) using the Viola and Jones algorithm [50] and, to speed up the face detection to the next consecutive image frames, face tracking is applied using the Kanade-Lucas-Tomasi (KLT) algorithm [51]. Details and a technical description of face detection are presented in our previous article [52]. Several image processing tasks are conducted to detect eye pupil positions in the image frames. First, the extracted facial ROI is converted into grayscale images and then the grayscale images are transformed into binary images, imposing a threshold value of 0.5. In the next level, the binary image is converted into an inverse image. Inverse image helps to find the edges of the face which are formed due to presence of eyes, nose and mouth. A Sobel edge detection method is used for detecting these edges in the inverse image. Then, the goal is to find the eyes; to do this, it is detected whether there is any circular region or not. Finally two circles for eyes are detected which provide the center of the circle or the center of the eye position. For better understanding, Algorithm 1 is provided with a simplified pseudocode is also provided below:

Feature Extraction
For the feature extraction, the raw eye movement signals are divided into fixation and saccade. The signal is fixation when eye gaze pauses in a certain position, and the signal is saccade when it moves to another position. In brief, a saccade is a quick, simultaneous movement of both eyes between two or more phases of fixation in the same direction. Saccade and fixation are calculated from the time series of eye positions' raw data (X, Y). First, the velocity is calculated based on two adjacent positions and their respective time and all 13 features are calculated. The list of features is extracted using eye positions which are listed in Table 1. The standard deviation of fixation velocities 10 Average of fixation velocities 11 Maximum of fixation duration 12 The standard deviation of fixation duration 13 Average of fixation duration

Classification Methods
In this paper, for cognitive load classification, three approaches have been deployed which are the ML approach, DL approach, and ML + DL approach. Figure 4 presents a block diagram for the machine learning approach including data processing, data sets preparation, training, validation and classification steps.  Figure 4 presents a block diagram for the machine learning approach including data processing, data sets preparation, training, validation and classification steps. (1) Input Signals of the approach are considered as eyeT signals recorded by the SmartEye system (i.e., the eye movement data are recorded in (X,Y) format) and the facial video was recorded by Microsoft LifeCam Studio (https://www.microsoft.com/accessories/en-us/products/webcams/lifecam-studio/q2f-00013) (Accessed date: 26 November 2021) and stored in a separate computer. In (2) Data processing, the approach focuses on the eye-pupil position extraction through facial images (presented in II (B)) and feature extraction (presented in II (C)). In (3) Data Set Preparation, the extracted features are divided into two classes based on the auditory 1-back secondary task by the participants during simulator driving. Based on the tasks performed by the participants, the classes are defined as '0' represents no cognitive load (n-back task) or baseline (i.e., primary driving task) and '1' represents a one-back Task (i.e., secondary task).

ML Approach
A secondary task was imposed six times for each driver while driving in a scenario. The duration of each secondary task was 60 s. There were 12 scenarios and each driver drove 60 s in each scenario. Eye movement parameters are extracted considering a window size of 30 s. Therefore, there are 24 samples in each test subject, where 12 samples belong to class '0' and the rest of the 12 samples belong to class '1'. A summary of the samples in each data set is shown in Table 2. In the (4) Model classifiers and classification results, both training and validation tasks are considered. For the training, five ML algorithms, SVM, LR, LDA, k-NN and DT, are deployed based on the extracted features through training and considered as an instance of supervised learning to classify drivers' cognitive load tasks [53]. For the validation, two cross-validation techniques are conducted which are k-fold cross-validation and holdout cross-validation. In the k-fold cross validation, data are partitioned into k randomly chosen subsets of roughly equal size where k-1 subsets are used for the training and the remaining subset is used for validating the trained model. (1) Input Signals of the approach are considered as eyeT signals recorded by the Smart-Eye system (i.e., the eye movement data are recorded in (X, Y) format) and the facial video was recorded by Microsoft LifeCam Studio (https://www.microsoft.com/accessories/enus/products/webcams/lifecam-studio/q2f-00013) (Accessed date: 26 November 2021) and stored in a separate computer. In (2) Data processing, the approach focuses on the eyepupil position extraction through facial images (presented in II (B)) and feature extraction (presented in II (C)). In (3) Data Set Preparation, the extracted features are divided into two classes based on the auditory 1-back secondary task by the participants during simulator driving. Based on the tasks performed by the participants, the classes are defined as '0' represents no cognitive load (n-back task) or baseline (i.e., primary driving task) and '1' represents a one-back Task (i.e., secondary task).
A secondary task was imposed six times for each driver while driving in a scenario. The duration of each secondary task was 60 s. There were 12 scenarios and each driver drove 60 s in each scenario. Eye movement parameters are extracted considering a window size of 30 s. Therefore, there are 24 samples in each test subject, where 12 samples belong to class '0' and the rest of the 12 samples belong to class '1'. A summary of the samples in each data set is shown in Table 2. In the (4) Model classifiers and classification results, both training and validation tasks are considered. For the training, five ML algorithms, SVM, LR, LDA, k-NN and DT, are deployed based on the extracted features through training and considered as an instance of supervised learning to classify drivers' cognitive load tasks [53]. For the validation, two cross-validation techniques are conducted which are k-fold cross-validation and holdout cross-validation. In the k-fold cross validation, data are partitioned into k randomly chosen subsets of roughly equal size where k-1 subsets are used for the training and the remaining subset is used for validating the trained model. This process is repeated k times such that each subset is used exactly once for validation. Holdout cross-validation partitions data randomly into exactly two subsets of a specified ratio for training and validation. This method performs training and testing only once, which minimizes the execution time. Then, for the classification, first true positive

DL Approach
In the deep learning approach, two deep learning architectures are used which are CNN and LSTM networks.

ML + DL Approach
CNN: Most of the existing CNN models consist of a large number of layers; for example, AlexNet has 25 layers, vgg16 has 41 layers, vgg19 has 47 layers and resnet101 even has 347 layers. The greater number of layers means more complex models and it requires more time to process. In this study, a CNN architecture with 16 layers is designed from scratch to classify the cognitive load. It was emphasized to design a CNN model considering a smaller number of layers so that the processing time of the images can be reduced as much as possible. Among the 16 layers, there is one input layer, three convolutional layers, three batch normalization layers, three relu layers, three max-pooling layers, one fully connected layer, one softmax layer and the final layer or output layer. The input layer reads the time series data and passes it into the series of other layers. The design of the CNN architecture and the dimensions of the hyperparameters are presented below in Figure 5.
ified ratio for training and validation. This method performs training and testing only once, which minimizes the execution time. Then, for the classification, first true positive (TP), false positive (FP), true negative (TN), and false-negative (FN) are calculated for each ML algorithm. Finally, classification accuracy and F1-score are obtained.

DL Approach
In the deep learning approach, two deep learning architectures are used which are CNN and LSTM networks.

ML + DL Approach
CNN: Most of the existing CNN models consist of a large number of layers; for example, AlexNet has 25 layers, vgg16 has 41 layers, vgg19 has 47 layers and resnet101 even has 347 layers. The greater number of layers means more complex models and it requires more time to process. In this study, a CNN architecture with 16 layers is designed from scratch to classify the cognitive load. It was emphasized to design a CNN model considering a smaller number of layers so that the processing time of the images can be reduced as much as possible. Among the 16 layers, there is one input layer, three convolutional layers, three batch normalization layers, three relu layers, three max-pooling layers, one fully connected layer, one softmax layer and the final layer or output layer. The input layer reads the time series data and passes it into the series of other layers. The design of the CNN architecture and the dimensions of the hyperparameters are presented below in Figure 5. LSTM: Another deep learning network called LSTM is used to classify the driver's cognitive load using time series eye movement data. The LSTM network in this study is a type of recurrent neural network (RNN) and it consists of five layers. The essential layers of an LSTM network are a sequence input layer and an LSTM layer. The time-series data are formed into sequences which are fed into the input layer of the network. Figure 6 demonstrates the architecture of a simple LSTM network for time series classification of cognitive load. LSTM: Another deep learning network called LSTM is used to classify the driver's cognitive load using time series eye movement data. The LSTM network in this study is a type of recurrent neural network (RNN) and it consists of five layers. The essential layers of an LSTM network are a sequence input layer and an LSTM layer. The time-series data are formed into sequences which are fed into the input layer of the network. Figure 6 demonstrates the architecture of a simple LSTM network for time series classification of cognitive load. The network starts with a sequence input layer followed by an LSTM layer. To predict class labels, the network ends with a fully connected layer, a softmax layer and a classification output layer. The LSTM layer architecture is illustrated in Figure 7. The network starts with a sequence input layer followed by an LSTM layer. To predict class labels, the network ends with a fully connected layer, a softmax layer and a classification output layer. The LSTM layer architecture is illustrated in Figure 7.  The network starts with a sequence input layer followed by an LSTM layer. To predict class labels, the network ends with a fully connected layer, a softmax layer and a classification output layer. The LSTM layer architecture is illustrated in Figure 7. This diagram presents the flow of the time-series data (x,y) with features (channels) C of length S through an LSTM layer. In the diagram, ℎ and denote the output or the hidden state and the cell state, respectively.
ML + DL Approach: The combination of two ML + DL approaches is also considered to classify the driver's cognitive load, which are CNN + SVM and AE + SVM.
CNN + SVM: A DL + ML approach considering CNN + SVM is presented in Figure  8. The CNN model presented in Figure 8 is used to extract features automatically from the raw data and the features are then used for cognitive load classification using a machine learning classifier.  This diagram presents the flow of the time-series data (x,y) with features (channels) C of length S through an LSTM layer. In the diagram, h t and c t denote the output or the hidden state and the cell state, respectively.
ML + DL Approach: The combination of two ML + DL approaches is also considered to classify the driver's cognitive load, which are CNN + SVM and AE + SVM.
CNN + SVM: A DL + ML approach considering CNN + SVM is presented in Figure 8. The CNN model presented in Figure 8 is used to extract features automatically from the raw data and the features are then used for cognitive load classification using a machine learning classifier.
The network starts with a sequence input layer followed by an LSTM layer. To pr dict class labels, the network ends with a fully connected layer, a softmax layer and classification output layer. The LSTM layer architecture is illustrated in Figure 7. This diagram presents the flow of the time-series data (x,y) with features (channe C of length S through an LSTM layer. In the diagram, ℎ and denote the output or t hidden state and the cell state, respectively.
ML + DL Approach: The combination of two ML + DL approaches is also consider to classify the driver's cognitive load, which are CNN + SVM and AE + SVM.
CNN + SVM: A DL + ML approach considering CNN + SVM is presented in Figu 8. The CNN model presented in Figure 8 is used to extract features automatically from t raw data and the features are then used for cognitive load classification using a machi learning classifier.  In this case, the SVM classifier has been deployed. The automatic feature extraction is performed in the fully connected layer which is the 14th layer of the networks. Then the extracted features are used to train the SVM model where k-fold cross-validation is performed. Finally, the model classifier is deployed for the classification of cognitive load. AE + SVM: Another DL + ML approach using AE + SVM is applied for the automatic feature extraction from the raw data and the classification of the cognitive load. The AE, in this case, is a stacked AE, which is presented in Figure 9.
In this case, the SVM classifier has been deployed. The automatic feature extraction is performed in the fully connected layer which is the 14th layer of the networks. Then the extracted features are used to train the SVM model where k-fold cross-validation is performed. Finally, the model classifier is deployed for the classification of cognitive load. AE + SVM: Another DL + ML approach using AE + SVM is applied for the automatic feature extraction from the raw data and the classification of the cognitive load. The AE, in this case, is a stacked AE, which is presented in Figure 9. The network of this stacked AE is formed by the two encoders and one softmax layer; however, the second encoder is also called the decoder. The number of hidden units in the first and second encoder is 100 and 50, respectively. In this case, the raw data of size (360 × 3600) are fed into the input layer and then the first encoder is trained. Traditionally, the number of hidden layers should be less than the data size. After training the first AE, the second AE is trained in a similar way. The main difference is that the features that were generated from the first AE will be the training data in the second AE. Additionally, the size of the hidden layers is decreased to 50, so that the encoder in the second AE learns an even smaller representation of the input data. The original vectors in the training data had 3600 dimensions. After passing them through the first encoder, this was reduced to 100 dimensions. After using the second encoder, this was reduced to 50 dimensions. Finally, these 50-dimensional vectors are used to train the SVM model to classify the two classes of cognitive load. K-fold cross-validation approach is considered for the validation.

Evaluation Methods
After the implantation of the proposed approach as a proof-of-concept application, several experiments are conducted where several evaluation methods are used. These experiments are mainly the comparisons between the raw signals, features and classification by both eyeT and camera systems. In addition, significant test between the classes and identification of optimal window size are also considered. The mentioned evaluation metrics are (1) cumulative percentage, (2) box plot and (3) sensitivity/specificity analysis.
The cumulative percentage is a way of expressing the frequency distribution of the raw data signals. It calculates the percentage of the cumulative frequency within each interval, much as relative frequency distribution calculates the percentage of frequency. The main advantage of cumulative percentage over cumulative frequency as a measure of the frequency distribution is that it provides an easier way to compare different sets of data. Cumulative frequency and cumulative percentage graphs are the same, except the vertical axis scale. It is possible to have the two vertical axes (one for cumulative frequency and another for cumulative percentage) on the same graph. The cumulative percentage is calculated by dividing the cumulative frequency by the total number of observations (n) and then multiplying it by 100 (the last value will always be equal to 100%). Thus, the cumulative percentage is calculated by Equation (1).
A box plot (also known as box and whisker plot) is a type of chart often used in explanatory data analysis to visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. Here, it shows a fivenumber summary of a set of data: (1) minimum score, (2) first (lower) quartile, (3) median, The network of this stacked AE is formed by the two encoders and one softmax layer; however, the second encoder is also called the decoder. The number of hidden units in the first and second encoder is 100 and 50, respectively. In this case, the raw data of size (360 × 3600) are fed into the input layer and then the first encoder is trained. Traditionally, the number of hidden layers should be less than the data size. After training the first AE, the second AE is trained in a similar way. The main difference is that the features that were generated from the first AE will be the training data in the second AE. Additionally, the size of the hidden layers is decreased to 50, so that the encoder in the second AE learns an even smaller representation of the input data. The original vectors in the training data had 3600 dimensions. After passing them through the first encoder, this was reduced to 100 dimensions. After using the second encoder, this was reduced to 50 dimensions. Finally, these 50-dimensional vectors are used to train the SVM model to classify the two classes of cognitive load. K-fold cross-validation approach is considered for the validation.

Evaluation Methods
After the implantation of the proposed approach as a proof-of-concept application, several experiments are conducted where several evaluation methods are used. These experiments are mainly the comparisons between the raw signals, features and classification by both eyeT and camera systems. In addition, significant test between the classes and identification of optimal window size are also considered. The mentioned evaluation metrics are (1) cumulative percentage, (2) box plot and (3) sensitivity/specificity analysis.
The cumulative percentage is a way of expressing the frequency distribution of the raw data signals. It calculates the percentage of the cumulative frequency within each interval, much as relative frequency distribution calculates the percentage of frequency. The main advantage of cumulative percentage over cumulative frequency as a measure of the frequency distribution is that it provides an easier way to compare different sets of data. Cumulative frequency and cumulative percentage graphs are the same, except the vertical axis scale. It is possible to have the two vertical axes (one for cumulative frequency and another for cumulative percentage) on the same graph. The cumulative percentage is calculated by dividing the cumulative frequency by the total number of observations (n) and then multiplying it by 100 (the last value will always be equal to 100%). Thus, the cumulative percentage is calculated by Equation (1).
A box plot (also known as box and whisker plot) is a type of chart often used in explanatory data analysis to visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. Here, it shows a fivenumber summary of a set of data: (1) minimum score, (2) first (lower) quartile, (3) median, (4) third (upper) quartile and (5) maximum score. The minimum score contains the lowest scores, excluding outliers. The Lower quartile shows the 25% of scores that fall below the lower quartile value (also known as the first quartile). The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less. The upper quartile shows the 75% of the scores that fall below the upper quartile value (also known as the third quartile). Thus, 25% of the data are above this value. The maximum score shows the highest score, excluding outliers (shown at the end of the right whisker). The upper and lower whiskers represent scores outside the middle 50% (i.e., the lower 25% of scores and the upper 25% of scores). The interquartile range (or IQR) is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).
In sensitivity/specificity analysis, based on the measurement of the prediction, the predicted response is compared with the actual response and compute the accuracy of each classifier based model in terms of the evaluation matrices sensitivity or recall, specificity, precision, F 1 -score, accuracy and ROC AUC [53]. All these matrices are calculated based on the formula given in Equations (2)-(6).

Sensitivity or Recall
Another important measurement is the receiver operating characteristic (ROC) curve and area under the curve (AUC). The ROC curve shows the performance of a classification model at all classification thresholds. This curve plots two parameters: true positive rate (TPR) and false-positive rate (FPR). Lowering the classification threshold classifies more items as positive, thus increasing both FP and TP. The AUC measures the entire two-dimensional area underneath the entire ROC curve, and it ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
Two statistical significance tests, Wilcoxon's signed ranked test and and DeLong's test, are conducted to compare the performance of the models based on ROC curves. The Wilcoxon signed-rank test is a nonparametric test which is used to compare two sets of scores that come from the same participants and z-score and p-values are obtained. The Delong test is performed between two models based on ROC curves and p-value and z-score of the two curves are obtained; p < 0.05 can be seen as a large difference between the two curves. If the z-score deviates too much from zero then it is concluded that one model has a statistically different AUC from the other model with p < 0.05.

Experimental Works and Results
The aim and objective of these experiments are to observe the performance of the camera system compare to the commercial Eye-Tracking (eyeT) system in terms of raw signal comparisons, extracted features comparisons and drivers' cognitive load classification. The experimental works in this study are four-fold: (1) comparison between raw signals extracted both by the camera system and by the commercial eyeT system, (2) Selection of Optimal Sampling Frequency, i.e., identification of the sampling frequency that is best for feature extraction and classification, (3) comparisons between the extracted features based on the camera system and the eyeT system and, finally, (4) cognitive load classification and comparisons between the camera system and the eyeT system.

Comparison between Raw Signals
This experiment aims to determine if the extracted raw signals of the camera system compare to the raw signal extracted from the eyeT system. Here, a visualization of raw signals and a cumulative percentage of the raw signals have been calculated. For the visualization of the raw signals both by the camera system and by the commercial eyeT system, a test subject is randomly selected and saccade and fixation signals are plotted for 200 samples which are presented in Figure 10.
parisons between the camera system and the eyeT system.

Comparison Between Raw Signals
This experiment aims to determine if the extracted raw signals of the camera system compare to the raw signal extracted from the eyeT system. Here, a visualization of raw signals and a cumulative percentage of the raw signals have been calculated. For the visualization of the raw signals both by the camera system and by the commercial eyeT system, a test subject is randomly selected and saccade and fixation signals are plotted for 200 samples which are presented in Figure 10. It is observed that the saccade peaks between eyeT and camera signals are identical, and only the amplitude of the fixation of the camera signal is higher than the amplitude of eyeT. This makes it easy for the feature extraction task by the proposed camera system.
The cumulative percentage experiment aims to see the frequency distribution on the raw data extracted from the proposed camera system compare to the eyeT system. To calculate the cumulative percentage, several steps are followed; they are: Step (1): the percentage of absolute differences between eyeT and camera raw signals are calculated for each subject considering x and y signals.
Step (2): Then, the cumulative percentages are calculated for x and y and their average values are considered as a cumulative percentage for a subject.
Step (3): Finally, the average cumulative percentage for 30 test subjects is calculated. An example of cumP calculation is presented in Table 3 and the average cumP for 30 test subjects is shown in Figure 11.
(degree/s) Figure 10. Visualization of the raw signals extracted by the camera system and compared with the eyeT system.
It is observed that the saccade peaks between eyeT and camera signals are identical, and only the amplitude of the fixation of the camera signal is higher than the amplitude of eyeT. This makes it easy for the feature extraction task by the proposed camera system.
The cumulative percentage experiment aims to see the frequency distribution on the raw data extracted from the proposed camera system compare to the eyeT system. To calculate the cumulative percentage, several steps are followed; they are: Step (1): the percentage of absolute differences between eyeT and camera raw signals are calculated for each subject considering x and y signals.
Step (2): Then, the cumulative percentages are calculated for x and y and their average values are considered as a cumulative percentage for a subject.
Step (3): Finally, the average cumulative percentage for 30 test subjects is calculated. An example of cumP calculation is presented in Table 3 and the average cumP for 30 test subjects is shown in Figure 11.  As can be observed from Figure 11, the cumulative percentages of 80, 90 and 100 are achieved for the absolute differences while considering as a threshold, i.e., 13, 20 and 40, respectively. That means to achieve 100% accuracy of the raw signal extracted from the camera system compared to the eyeT system, the absolute differences between the two raw signals should be 40 as an average value of 30 subjects.

Selection of Optimal Sampling Frequency
Once the raw signals are extracted and compared, the next task is to extract features for the cognitive load classification. To achieve good features and better classification accuracy, the sampling frequency should be the best selection; thus, this experiment aims to identify the best sampling frequency to extract features and cognitive load classification. In this study data of each of the secondary n-back tasks were imposed on the driver for one minute during simulator driving. Three different time windows, i.e., 60 s, 30 s and 15 s, were considered for feature extraction to observe which window size performs the best for ML algorithms, i.e., SVM, LR, LDR, k-NN and DT, considering F1-score. Table 4 presents the performance, i.e., F1-Score and Accuracy, of all five ML algorithms both for eyeT and camera data. It can be observed that the F1-Score and Accuracy are better for 30 s sampling frequency than 60 s and 15 s sampling frequencies. It is also observed that k-fold cross validation (i.e., k = 5) achieves higher accuracy than holdout cross validation. Here, Figure 11. Average cumulative percentage of all 30 test subjects considering raw signals both extracted by the proposed camera system compare to the eyeT system.
As can be observed from Figure 11, the cumulative percentages of 80, 90 and 100 are achieved for the absolute differences while considering as a threshold, i.e., 13, 20 and 40, respectively. That means to achieve 100% accuracy of the raw signal extracted from the camera system compared to the eyeT system, the absolute differences between the two raw signals should be 40 as an average value of 30 subjects.

Selection of Optimal Sampling Frequency
Once the raw signals are extracted and compared, the next task is to extract features for the cognitive load classification. To achieve good features and better classification accuracy, the sampling frequency should be the best selection; thus, this experiment aims to identify the best sampling frequency to extract features and cognitive load classification. In this study data of each of the secondary n-back tasks were imposed on the driver for one minute during simulator driving. Three different time windows, i.e., 60 s, 30 s and 15 s, were considered for feature extraction to observe which window size performs the best for ML algorithms, i.e., SVM, LR, LDR, k-NN and DT, considering F 1 -score. Table 4 presents the performance, i.e., F 1 -score and Accuracy, of all five ML algorithms both for eyeT and camera data. It can be observed that the F 1 -score and Accuracy are better for 30 s sampling frequency than 60 s and 15 s sampling frequencies. It is also observed that k-fold cross validation (i.e., k = 5) achieves higher accuracy than holdout cross validation. Here, from the comparion it is observed that the highest F 1 -score and accuracy are achieved when the sampling frequency is 30 Hz. Therefore, all the experiments in subsequent sections only consider the data sets of sampling frequency 30 s and k-fold cross-validation.

Comparisons between the Extracted Features
This experiment focuses on the comparisons between the extracted features by the camera system and the eyeT system. Here, correlation coefficient is measured between the feature sets to observe the closeness of the features, and then features are compared between the system considering 0-back and 1-back cognitive loads. The correlation coefficient 'r' between the features of the eyeT and camera system is presented in Figure 12. In each case, p values are 0 (i.e., <0.05) which which means the correlations are significant. The highest value of r is 0.95 and the lowest value is 0.82 which indicates that there is a good positive relation between features of the systems.
Statistical comparison between features extracted from 0-back and 1-back classes are conducted to observe if there are any significant differences in cognitive load with noncognitive load tasks both for eyeT and camera system. Here, four statistical parameters, maximum (MAX), minimum (MIN), average (AVG) and standard deviation (STD), are calculated for all test subjects. Figure 13 presents the average summary of the statistical The highest value of r is 0.95 and the lowest value is 0.82 which indicates that there is a good positive relation between features of the systems.
Statistical comparison between features extracted from 0-back and 1-back classes are conducted to observe if there are any significant differences in cognitive load with noncognitive load tasks both for eyeT and camera system. Here, four statistical parameters, maximum (MAX), minimum (MIN), average (AVG) and standard deviation (STD), are calculated for all test subjects. Figure 13 presents the average summary of the statistical measurements for the eyeT system. Figure 14 presents the average summary of the statistical measurements for the camera system. Figure 12. Correlation coefficients between the features extracted both by the eyeT and camera systems.
The highest value of r is 0.95 and the lowest value is 0.82 which indicates that there is a good positive relation between features of the systems.
Statistical comparison between features extracted from 0-back and 1-back classes are conducted to observe if there are any significant differences in cognitive load with noncognitive load tasks both for eyeT and camera system. Here, four statistical parameters, maximum (MAX), minimum (MIN), average (AVG) and standard deviation (STD), are calculated for all test subjects. Figure 13 presents the average summary of the statistical measurements for the eyeT system. Figure 14 presents the average summary of the statistical measurements for the camera system. Figure 13. Summary of the statistical parameters comparing n-back task 0 and 1 considering features eyeT system. Figure 13. Summary of the statistical parameters comparing n-back task 0 and 1 considering features eyeT system. 19 16 of 24 Figure 14. Summary of the statistical parameters comparing n-back task 0 and 1 considering features camera system.
It can be observed in both cases that there are significant differences between 0-back and 1-back, considering all 13 extracted features.
Box plots are presented to see the significant differences of features between 0-back and 1-back. Here, the summary of the comparisons includes (1) first (lower) quartile, (2) median and (3) third (upper) quartile scores. Figure 15 presents box plots for the features extracted by the eyeT system and Figure 16 presents box plots for the features extracted by the camera system. It can be observed in both cases that there are significant differences between 0-back and 1-back, considering all 13 extracted features.
Box plots are presented to see the significant differences of features between 0-back and 1-back. Here, the summary of the comparisons includes (1) first (lower) quartile, (2) median and (3) third (upper) quartile scores. Figure 15 presents box plots for the features extracted by the eyeT system and Figure 16 presents box plots for the features extracted by the camera system. Box plots are presented to see the significant differences of features between 0-back and 1-back. Here, the summary of the comparisons includes (1) first (lower) quartile, (2) median and (3) third (upper) quartile scores. Figure 15 presents box plots for the features extracted by the eyeT system and Figure 16 presents box plots for the features extracted by the camera system.   Box plots are presented to see the significant differences of features between 0-back and 1-back. Here, the summary of the comparisons includes (1) first (lower) quartile, (2) median and (3) third (upper) quartile scores. Figure 15 presents box plots for the features extracted by the eyeT system and Figure 16 presents box plots for the features extracted by the camera system.   According to both figures, 0-back features and 1-back features have significant differences considering 1st quantile, 3rd quantile and median values.

Classification Results
This experiment focuses on the robustness of ML and DL algorithms in terms of cognitive load classification. Here, the average classification accuracy for the experiments is observed for five ML algorithms, SVM, LR, LDA, k-NN and DT, and three DL algorithms, CNN, LSTM and AE, both for eyeT and camera features, considering a 30 Hz sampling frequency. K-fold cross-validation was performed for each classifier, where K is 5. Sensitivity and specificity are also calculated for each algorithm. Table 5 presents the classification accuracy, sensitivity and specificity for SVM, LR, LDA, k-NN, and DT, both for eyeT and camera data. Table 5. Sensitivity, specificity, precision, F 1 -score, and accuracy for svm, lr, lda, k-nn, and dt classifiers for both eyet and camera data, where total observation is 720, 0-back classes are 360 and 1-back classes are 360. Different hyperparameters were explored to achieve the highest classification accuracy for all ML models. In the SVM model, three kernel functions, i.e., 'linear', 'gaussian' (or 'rbf') and 'polynomial' kernel function, were deployed, where the 'linear' kernel function was responsible for producing the highest classification accuracy. In the LR model, a function called 'logit function' was used for the classification task. In the LDA model, five types of discriminator functions are used, 'linear', 'pseudolinear', 'diaglinear', 'quadratic' and 'pseudoquadratic' or 'diagquadratic', where the best accuracy is achieved using the 'linear' discriminator function. In the k-NN model, different value of k is explored where the best one is k = 10 for 'Euclidean' distance function. In the DT model, three criterion functions are explored for choosing a split which are 'gdi' (Gini's diversity index), 'twoing' for the twoing rule or 'deviance' for maximum deviance reduction (also known as crossentropy). The best was 'gdi', where the maximum number of split is 4.
According to Table 5, the highest overall accuracy achieved is 92% for camera data using SVM classifiers and the highest classification accuracy for eyeT data achieved is 92% for SVM, LR and LDA classifiers which are shaded in gray color.
For the visualization of the tradeoff between true positive rate (TPR) and false-positive rate (FPR), ROC curves are plotted for all ML algorithms and AUC values are calculated which is presented in Figure 17 for eyeT system and Figure 18 for the camera system. In the ROC curve, for every threshold, TPR and FPR are calculated and plotted on one chart. According to Table 5, the highest overall accuracy achieved is 92% for camera data using SVM classifiers and the highest classification accuracy for eyeT data achieved is 92% for SVM, LR and LDA classifiers which are shaded in gray color.
For the visualization of the tradeoff between true positive rate (TPR) and falsepositive rate (FPR), ROC curves are plotted for all ML algorithms and AUC values are calculated which is presented in Figure 17 for eyeT system and Figure 18 for the camera system. In the ROC curve, for every threshold, TPR and FPR are calculated and plotted on one chart.    According to Table 5, the highest overall accuracy achieved is 92% for camera data using SVM classifiers and the highest classification accuracy for eyeT data achieved is 92% for SVM, LR and LDA classifiers which are shaded in gray color.
For the visualization of the tradeoff between true positive rate (TPR) and falsepositive rate (FPR), ROC curves are plotted for all ML algorithms and AUC values are calculated which is presented in Figure 17 for eyeT system and Figure 18 for the camera system. In the ROC curve, for every threshold, TPR and FPR are calculated and plotted on one chart.   The higher TPR and the lower FPR are for each threshold is considered as better performance and so classifiers that have curves that are more top-left-side are better. To get one number that tells how good the ROC curve is, the area under the ROC or ROC AUC score is calculated. Here, in the figures, the more top-left the curve is the higher the area and hence higher ROC AUC score. Figure 17 shows that the AUC values for the eyeT system considering SVM, LR and LDA are 0.97 and for k-NN and DT are 0.95 and 0.92, respectively which indicates that the ROC curves for SVM, LR and LDA show better performance than k-NN and DT. Figure 18 shows that the AUC values for the camera considering SVM, LR and LDA are 0.92 and for k-NN and DT are 0.91 and 0.85, respectively, which indicates that the ROC curves for SVM, LR and LDA show better performance than k-NN and DT.

Statistical Significance Test
Two statistical significance tests (i.e., Wilcoxon test and delong's test) are conducted between camera and eyeT data for each model. Initially the values of P, H and stats are calculated for each model considering actual and predicted classes of the model using Wilcoxon signed rank test. Here, P is the probability of observing the given result, H is the hypothesis which is performed at the initial hypothesis setting 0.05 (H = 0 indicates that the null hypothesis ("median is zero") cannot be rejected at the 5% level, H = 1 indicates that the null hypothesis can be rejected at the 5% level) and stats is a structure containing one or two fields (The field 'signedrank' contains the value of the signed rank statistic for positive values in X, X-M or X-Y. If P is calculated using a normal approximation, then the field 'zval' contains the value of the normal (Z) statistic.).
For conducting this test, initially, the null hypothesis is set to H 0 that there is no difference in performance measures of a classifier with significance level 0.05. The model/models which is/are significantly different than others are shaded in gray color.
The summaries of the two statistical significance tests for Wilcoxon's signed ranked and Delong's test are presented in Tables 6 and 7, respectively.

Discussion
The main goal of this study was to investigate the classification accuracy of drivers' cognitive loads based on saccade and fixation parameters. These parameters are extracted from eye positions throughout the facial images recorded by a single digital camera. The classification performance of the camera system was also compared with the eyeT system to investigate the closeness of the classification results. Based on the literature study, 13 eye movement features are extracted from both the camera and the eyeT data, which are presented in Table 1. These features have shown good performance in cognitive load classification in [9,47] which is also true for this study, where the highest classification for both the eyeT and camera system is 92% which is at least 2% higher than the state-of-the-art accuracy. In [50], the highest cognitive load classification accuracy was achieved at 86% using an ANN algorithm considering a different workload situation. The authors of [51] reviewed the current state of the art and found that the average classification accuracy of cognitive load is close to 90% when considering eye movement parameters. In [52], the highest cognitive load classification accuracy achieved was 87%.
As the raw signal for eye movement was extracted from facial images captured by a camera system, the raw signals were plotted both for eyeT and camera system to observe the closeness of the signals and also to observe the characteristics of the signals, which is shown in Figure 10 which indicates that both the signals from eyeT and camera systems look similar considering saccade peaks and with small differences in the amplitudes of fixation. The actual reason for this amplitude difference of fixation is unknown. However, it might have occurred due to a change in the sampling frequency of eyeT from 50 Hz to 30 Hz. A cumulative percentage between raw signals of eyeT and camera systems is also calculated and presented as another experiment to see the similarity of the two raw signals. Here, the similarity is observed by using a threshold value of absolute difference which is shown in Figure 11 which shows that the cumulative percentage of 80 and 90 is achieved when the absolute differences are 13 and 20 respectively.
Before conducting feature extraction, comparisons and the classification task, an experiment is conducted based on F 1 -score and accuracy to find the optimal sampling size for feature extraction. As such, three feature sets are generated both from the raw signals using the eyeT and camera systems. Here, the considered sampling sizes are 60 s, 30 s and 15 s. The summary of results using ML algorithms is presented in Table 4.
Before conducting classification, three types of statistical analyses have been conducted for saccade and fixation features between eyeT and camera systems using several statistical parameters such as correlation coefficients, MAX, MIN, AVG, STD and boxplot. The 1st experiment is conducted on the comparisons between the features of eyeT and camera systems using correlation coefficients. Here, the closeness of the extracted features between the two systems is observed and presented in Figure 13. The results show that the correlation coefficients for all saccade and fixation features between eyeT and camera range from 0.82 to 0.95, which indicates a strong positive relation between eyeT and camera. The 2nd experiment is conducted to observe if there are any significant differences between the features of 0-back and 1-back cognitive load classes. Here, statistical parameters MAX, MIN, AVG and STD show that there are significant differences between 0-back and 1-back features both for eyeT and camera, as presented in Figures 14 and 15, respectively. Boxplots for all 13 features between 0-back and 1-back also confirmed that there are significant differences considering 1st quantile, 3rd quantile and median values both for eyeT and camera. These boxplots are presented in Figure 16 for eyeT and Figure 17 for the camera.
Five machine learning algorithms, SVM, LR, LDA, k-NN and DT, have been investigated to classify cognitive load. The summary of the classification results including sensitivity, specificity, precision, F 1 -score and accuracy are presented in V. The highest accuracy both for the eyeT and camera systems is achieved at 92% for SVM. In this paper, different kernel functions such as the linear kernel, Gaussian kernel and polynomial kernel functions are investigated, and results show that linear kernel function performs better than other kernels. However, considering polynomial kernels, the training accuracy increases but then the model tends to overfit due to the huge spreading of the data sets. LR and LDA classifiers also show similar performance in binary classification.
To take the advantage of automatic feature extraction, three DL algorithms CNN, LSTM and AE, are deployed. The results suggest that both DL and DL + ML approaches outperform in a similar manner. However, the highest accuracy of 0.91% is obtained by CNN which is 0.01% higher than LSTM and AE. In CNN, there are few pooling layers for which the features are organized spatially like an image, and thus downscaling the features makes sense, while LSTM and AE do not have this advantage. However, considering the processing time of each image, DL-based technology can be a potential contributor in advanced driver assistive systems. In the experiment, it was noted that the processing time of 1000 images is less than 0.1 sec using an NVIDIA GPU.
For the visualization of the tradeoff between TPR and FPR, ROC curves are plotted and AUC values are calculated. Figure 17 shows that the AUC values for the eyeT system considering SVM, LR and LDA show better performance than k-NN and DT. Figure 18 shows that the AUC values for the camera system considering SVM, LR, and LDA show better performance than k-NN and DT. However, to compare the performance of the models with each other, two statistical significance tests, Wilcoxon signed rank test and Delong's test, are deployed. For both tests, p-values and z-scores resemble similar characteristics. The results based on p-values and z-scores suggest that the DT models are significantly different than other models by the null hypothesis p < 0.05. Technically, this experiment has several limitations.

Conclusions
A non-contact-based driver cognitive load classification scheme based on eye movement features is presented, considering a driving simulator environment, which is a new technique for advanced driver assistive systems. The average highest accuracy for camera features is achieved at 92% by using the SVM classifier. In this paper, saccade and fixation features are extracted from the driver's facial image sequence. However, three DL models are used for automatic feature extraction from raw signals, for which the highest classification accuracy is 91%. It is observed that manual feature extraction provides 1% better accuracy than automatic feature extraction. Non-contact-based driver's cognitive load classification can be optimized by minimizing extraction error of eye movement parameters, i.e., saccade and fixation features from facial image sequences. Accurate eye detection and tracking is still a challenging task, as there are many issues associated with such systems. These issues include the degree of eye openness, variability in eye size, head pose, facial occlusion, etc. Different applications that use eye tracking are affected by these issues at different levels. Therefore, those factors need to be considered for further improvement. Additionally, the experiment should be conducted considering real road driving for optimum reliability.
Author Contributions: H.R. contributes in conceptualization, methodology, software, validation, resources, data curation and writing-original draft preparation; M.U.A. contributes in conceptualization, methodology, supervision, project administration and funding acquisition; S.B. (Shaibal Barua) contributes in conceptualization, methodology and writing-review and editing; P.F. contributes in conceptualization, methodology, supervision, project administration and funding acquisition; S.B. (Shahina Begum) contributes in conceptualization, methodology, review, supervision, project administration and funding acquisition. All authors have read and agreed to the published version of the manuscript.
Funding: This article is part of 'SafeDriver' project and funded by Swedish Knowledge Foundation (KKS).
Institutional Review Board Statement: The regional ethics committee at Linköping University, Sweden (Dnr 2014/309-31) approved the study.

Informed Consent Statement:
Each participant signed an informed consent form.

Data Availability Statement:
The data is not publicly available due to GDPR issue.