ECG-Based Driving Fatigue Detection Using Heart Rate Variability Analysis with Mutual Information

: One of the WHO’s strategies to reduce road trafﬁc injuries and fatalities is to enhance vehicle safety. Driving fatigue detection can be used to increase vehicle safety. Our previous study developed an ECG-based driving fatigue detection framework with AdaBoost, producing a high cross-validated accuracy of 98.82% and a testing accuracy of 81.82%; however, the study did not consider the driver’s cognitive state related to fatigue and redundant features in the classiﬁcation model. In this paper, we propose developments in the feature extraction and feature selection phases in the driving fatigue detection framework. For feature extraction, we employ heart rate fragmentation to extract non-linear features to analyze the driver’s cognitive status. These features are combined with features obtained from heart rate variability analysis in the time, frequency, and non-linear domains. In feature selection, we employ mutual information to ﬁlter redundant features. To ﬁnd the number of selected features with the best model performance, we carried out 28 combination experiments consisting of 7 possible selected features out of 58 features and 4 ensemble learnings. The results of the experiments show that the random forest algorithm with 44 selected features produced the best model performance testing accuracy of 95.45%, with cross-validated accuracy of 98.65%.


Introduction
One of the goals of SDG 3 focuses on reducing road traffic injuries and fatalities.The goal is to halve the number of deaths and injuries caused by road traffic accidents worldwide by 2030.To achieve SDG 3.6, the WHO suggests a comprehensive strategy that includes enhancing road safety management, enhancing road regulation and enforcement, promoting safer road infrastructure and mobility, increasing vehicle safety standards, enhancing emergency trauma care, and increasing public awareness [1].In this paper, we focus on vehicle safety by developing a framework for detecting driver fatigue.Driving fatigue detection aims to detect fatigue or drowsiness while driving by monitoring a driver's physiological state with various types of fatigue measurement methods [2].For example, an electrocardiogram or an encephalogram are examples of objective measurements.
In our previous study [3], we developed a driving fatigue detection framework.Developments were carried out at three phases in the driving fatigue detection framework using ECG.In the data preprocessing phase, we applied the resampling method with an overlapping window to increase the diversity of the dataset.In the data processing phase, we applied heart rate variability (HRV) analysis methods with two non-linear approaches, a Poincare plot analysis and a multi-fractal detrended fluctuation analysis.Thus, five HRV analysis feature extraction methods were used: statistical analysis, geometrical analysis, Information 2023, 14, 539 2 of 41 spectral analysis, Poincare plot analysis, and multi-fractal detrended fluctuation analysis.These methods were utilized to extract features from the NN intervals in the time domain, frequency domain, and non-linear approach.The number of total extracted features was 54.In the classification phase, the ensemble method was applied to classify two fatigue states: alert and fatigue.With all of these developments in three phases, the random forest classification model produced a cross-validated accuracy of 97.98% and a testing accuracy of 86.36%.The AdaBoost classification model produced a cross-validated accuracy of 98.82% and a testing accuracy of 81.82%.This study shows significant results; however, our previous study uncovered the following problems.

1.
The study did not consider the cognitive fatigue status experienced by drivers when they were fatigued.Several studies [4][5][6] stated that there is an association between driver cognitive fatigue and the causal factor of driving fatigue.Therefore, it is necessary to consider extracting information from interval NN data related to cognitive status.2.
The study did not consider the redundant feature factors that can affect the performance of the classification model.It was demonstrated that the Poincare plot analysis and multifractal detrended fluctuation analysis methods only improved in the random forest and AdaBoost models but not in the bagging and gradient boosting models.This is due to the total of 54 features extracted, which makes the model too complex and reduces the model's interpretability.Therefore, a feature selection method is needed to reduce the number of redundant features.
To address all of these issues and improve the performance of the classification model, we propose two developments for feature extraction and feature selection within the proposed driving fatigue detection framework (Figure 1).In this paper, our main contributions are explained as follows: 1.
In the feature extraction phase, we applied heart rate fragmentation.It was first proposed by Costa et al. [7] and has never been used or further investigated in previous driving fatigue studies (Table 1).Costa et al. [8] reported that heart rate fragmentation can be useful to monitor cognitive status.We hypothesized that heart rate fragmentation can be used to monitor driver cognitive status, which represents the fatigue state of the driver.Heart rate fragmentation is used to extract non-linear features from NN interval data.2.
Our previous study [3] had no feature selection applied in the driving fatigue detection framework; therefore, we added the feature selection phase to the driving fatigue detection framework proposed here (Figure 1).In the feature selection phase, we chose mutual information over other feature selection methods because we applied both linear and non-linear feature extraction methods, and mutual information can capture both linear and non-linear relationships between variables.Additionally, mutual information can be used to measure the relevance of features to the target variable [9].
This study is organized as follows: the second section, "Related Works", discusses previous driving fatigue detection studies that used ECG fatigue measurement.The third section, "Materials and Methods", describes the dataset used in this study, data preprocessing, feature extraction, feature selection, and the classification model.The fourth section, "Results and Discussion", presents the various experiments applied to the proposed driving fatigue detection method, the experimental results, and discussion.The last section, "Conclusions", concludes the study.

Related Works
May and Baldwin [10] presented a study showing that it is important to first identify the causal factors of fatigue before deciding what measurement methods to use to detect driving fatigue.They presented two types of causal factors for driving fatigue: sleep-related fatigue and task-related fatigue.Sleep-related fatigue is caused by the circadian rhythm effect, prolonged wakefulness, and sleep deprivation.Task-related fatigue is a type of fatigue that results from the length and task demands of driving and is unrelated to any sleep-related causes.The type of fatigue presented in this paper is sleep-related fatigue, specifically caused by sleep deprivation.
Bier et al. [11] presented two fatigue measurement methods: performance-based fatigue measurement and condition-based fatigue measurement.In surveys of driving fatigue studies [12,13], most researchers applied fatigue measurements based on condition data.For example, Khunpisuth et al. [14] used fatigue measurement based on aspects of the physical condition (such as facial features, eye features, or a combination of physical features), which resulted in a very high accuracy of up to 99.59%.Another study [15] applied fatigue measurement based on physiological conditions using EEG, which resulted in a high accuracy of 97.19%.Due to its high accuracy in fatigue detection, fatigue measurement based on condition data is more popular than fatigue measurement based on performance data among driving fatigue detection studies.
There are several important points to consider when choosing a fatigue measurement method: accuracy, non-intrusiveness, suitability for any driving conditions, adaptability to any driver condition, and practicality [12,16,17].The most popular fatigue measurement method is measuring fatigue based on the driver's physical state [16,18], for example.This method is practical, non-intrusive, and generally produces very high accuracy, but it is not suitable for all driving conditions with regard to daytime or nighttime driving, and it cannot adapt to any driver condition related to skin color, eye color, or cloth.Another popular fatigue measurement method is measuring fatigue through physiological signals using EEG, for example [15].This method has adaptability to any driver condition, is suitable for any driving conditions, and generally has high accuracy because it measures brain activity directly; however, it is not practical since it needs at least six electrodes placed on the scalp to measure brain signals [19].
In this paper, we chose ECG as a fatigue measurement method because it can adapt to the driver's condition, is suitable in any driving condition, and has high accuracy if it is combined with the right processing and classification methods.The use of an ECG while

Related Works
May and Baldwin [10]  Bier et al. [11] presented two fatigue measurement methods: performance-based fatigue measurement and condition-based fatigue measurement.In surveys of driving fatigue studies [12,13], most researchers applied fatigue measurements based on condition data.For example, Khunpisuth et al. [14] used fatigue measurement based on aspects of the physical condition (such as facial features, eye features, or a combination of physical features), which resulted in a very high accuracy of up to 99.59%.Another study [15] applied fatigue measurement based on physiological conditions using EEG, which resulted in a high accuracy of 97.19%.Due to its high accuracy in fatigue detection, fatigue measurement based on condition data is more popular than fatigue measurement based on performance data among driving fatigue detection studies.
There are several important points to consider when choosing a fatigue measurement method: accuracy, non-intrusiveness, suitability for any driving conditions, adaptability to any driver condition, and practicality [12,16,17].The most popular fatigue measurement method is measuring fatigue based on the driver's physical state [16,18], for example.This method is practical, non-intrusive, and generally produces very high accuracy, but it is not suitable for all driving conditions with regard to daytime or nighttime driving, and it cannot adapt to any driver condition related to skin color, eye color, or cloth.Another popular fatigue measurement method is measuring fatigue through physiological signals using EEG, for example [15].This method has adaptability to any driver condition, is suitable for any driving conditions, and generally has high accuracy because it measures brain activity directly; however, it is not practical since it needs at least six electrodes placed on the scalp to measure brain signals [19].
In this paper, we chose ECG as a fatigue measurement method because it can adapt to the driver's condition, is suitable in any driving condition, and has high accuracy if it is combined with the right processing and classification methods.The use of an ECG while driving can be intrusive if the ECG recording method uses the standard 3, 6, or 12 lead placements; however, the ECG can be non-intrusive if the driver uses a single-lead ECG device, a textile-based ECG, or a wearable ECG device, such as a smart watch.Our previous study [3] proposed a framework for driving fatigue detection using an ECG with two electrodes.With the right preprocessing of ECG data using the proposed resampling method, the AdaBoost model achieved an accuracy of 98.82% for validation and 81.81% for testing (Table 1); thus, its model performance can be increased by using the feature engineering method or extracting more features from RR interval data.As a comparison, considering the development of a driving fatigue detection framework, we reviewed a list of driving fatigue detection methods with ECG in the literature from 2018 to 2023, as shown in Table 1.Generally, recent studies over the past five years, as shown in Table 1, have predominantly utilized a binary classification approach, distinguishing between "alert" and "fatigue" states.Researchers are still actively exploring various combinations of preprocessing, feature extraction, feature selection, and classification methods to improve the accuracy of driver fatigue detection.
In Table 1, most driving fatigue detection studies applied feature selection with a filter approach to select relevant features to improve the model's performance.For example, Kim and Shin [20] extracted 104 features using HRV analysis, weighted standard deviation, weighted mean, and dominant respiration, and applied the greedy feed forward to select 12 features out of 104 features.It filtered out the number of features by 88.46%.Its model performance resulted in an area under the curve of 95% for validation and an unknown area for testing.The majority of these studies (Table 1) that applied feature selection did not specify the method used to determine the optimal number of features and did not explore the effect of the selected features on the performance of the model.On the other hand, a study by Babeian et al. [21] extracted 52 features using the wavelet transform and proposed a feature selection algorithm along with ensemble learning.It selected 24 features, or 46.15% of the total number of features, based on a balanced classification rate.Their feature selection algorithm chose the number of selected features-24 features-with the highest balanced classification rate.Its model performance resulted in an accuracy of 92.5% for training and an unknown accuracy for testing.Babeian et al.'s approach inspired us to explore the effect of a number of selected features on the model's performance.
Another factor that should be considered in developing a driving fatigue detection framework is the feature extraction method.Feature extraction also plays an important role in detecting driving fatigue, as it can reduce dimensionality, improve the interpretability of data, and improve the model's performance [22].The majority of studies (Table 1) used HRV analysis to extract data from the time domain, frequency domain, and non-linear approach.HRV analysis is still an effective feature extraction method to increase the model's performance.For example, a study by Kundinger et al. [23] extracted 26 features from RR interval data using HRV analysis and resulted in a cross-validated accuracy of 97.37% with the random forest model, and our previous study [3] extracted 54 features from RR interval data using HRV analysis and resulted in a cross-validated accuracy of 98.82% with the AdaBoost model.Furthermore, our previous study [3] also proved that the use of HRV analysis combined with time, frequency, and a non-linear approach results in improvement in model performance than using HRV analysis with time and frequency domains alone.In short, this paper proposes an improvement in the driving fatigue framework compared to our previous study [3].We focus on the development of the feature extraction phase using the HRV analysis method, and of the feature selection phase using the mutual information method.

ECG Dataset
In this study, an electrocardiogram was utilized as the objective method for measuring driver fatigue.The dataset utilized for testing the proposed driving fatigue detection framework is taken from [29], the same dataset utilized in our previous study [3].The dataset was created based on a sleep-related fatigue induction approach.Eleven healthy subjects (ten men and one woman) between the ages of 24 and 28 participated in the driving simulation.Each of them held a driver's license and simulated driving for at least 30 min under the two specified driving states: alert and fatigued.Each driving state was tested on two separate days.In order to achieve the state of alert driving, drivers were given directions to sleep for at least seven hours prior to the driving simulation.In order to induce driver fatigue, the drivers were given directions to go to bed late and sleep for less than seven hours prior to the driving simulation.
Before the experiment, the subjects completed a subjective questionnaire to assess their levels of fatigue, and the results demonstrate that the sleep-deprived subjects were significantly more fatigued than the well-rested subjects.The average sleepiness scores for all subjects were 1.4 for the well-rested state and 4.1 for the sleep-deprived state (1: rarely sleepy to 5: very sleepy).All the subjects were compensated approximately $10 per hour for their participation in the simulated driving.The Institutional Review Board of the Gwangju Institute of Science and Technology approved the experiment [29].
There are three types of recorded physiological signals contained in the dataset [29]: EEG, EOG, and ECG.However, we only used the ECG data that was recorded using BioSemi ActiveTwo with a sampling rate of 512 Hz in this study.The ECG data were labeled sleep-good (SG), representing the alert state, and sleep-bad (SB), representing the fatigued state.The ECG signals were acquired using a modified lead-I with two electrodes placed in the second intercostal position, as shown in Figure 2.
ation 2023, 14, x FOR PEER REVIEW Before the experiment, the subjects completed a subjective their levels of fatigue, and the results demonstrate that the sleep significantly more fatigued than the well-rested subjects.The a for all subjects were 1.4 for the well-rested state and 4.1 for the rarely sleepy to 5: very sleepy).All the subjects were compensate hour for their participation in the simulated driving.The Instit the Gwangju Institute of Science and Technology approved the e There are three types of recorded physiological signals cont EEG, EOG, and ECG.However, we only used the ECG data that Semi ActiveTwo with a sampling rate of 512 Hz in this study.Th sleep-good (SG), representing the alert state, and sleep-bad (SB), r state.The ECG signals were acquired using a modified lead-I wi in the second intercostal position, as shown in Figure 2.
The dataset used in this study exhibits variations among th compassing age, gender, and ECG recording time; however, spe ing the age and gender of the individual subjects is unavailable ECG recording time (measured in minutes) for each subject in tigue states: alert and fatigued.The dataset used in this study exhibits variations among the subjects, primarily encompassing age, gender, and ECG recording time; however, specific information regarding the age and gender of the individual subjects is unavailable.Figure 3

QRS Detection
The next phase was QRS wave detection from the ECG signals (Figure 1).Afterwards, the peak of the R wave was detected for each QRS wave.The accuracy of the detection of the QRS wave is important because it relates to the detection of the R wave and can indirectly affect the analysis of heart rate variability [30].In this study, we chose the same QRS

QRS Detection
The next phase was QRS wave detection from the ECG signals (Figure 1).Afterwards, the peak of the R wave was detected for each QRS wave.The accuracy of the detection of the QRS wave is important because it relates to the detection of the R wave and can indirectly affect the analysis of heart rate variability [30].In this study, we chose the same QRS detection algorithm as that used in our previous study, the Pan-Tompkins algorithm [3].The minimum sampling frequency needed for performing the Pan-Tompkins approach is 200 Hz [31], and the dataset utilized for the study fulfills this requirement.After detecting the R wave of a QRS wave, the next step was to measure the RR interval, which is the distance between two consecutive R waves in two adjacent QRS complex waves.This study uses the term NN interval instead of RR interval.

Data Splitting and Labeling
Figure 3 shows the different ECG recording durations for each subject under two different states.Before splitting the data, the dataset should first be balanced to prevent bias towards one class and improve model performance [32].This study used an undersampling approach instead of a data augmentation or oversampling approach to obtain a balanced dataset, considering the originality of the data and the risk of overfitting in the classification of fatigue [33].In Figure 3, the shortest ECG recording duration is 23.59 min, taken from the 11th subject in a sleep-good or alert state.Thus, the ECG recording durations for each subject under two different states were taken for up to 23.59 min and used as a duration reference for the balanced dataset.
In machine learning applications, data splitting is generally needed to evaluate the performance of a model.The most prevalent method of data splitting is to divide a dataset into two separate parts.The first part of the dataset used to train the model is known as the training dataset.The rest of the data, also known as the testing dataset or new data, is used to assess the predictive ability of the model.There is no standardization of the data splitting ratio between the training and testing datasets because it can vary between studies of machine learning applications.However, when splitting data, two conditions must be met: the training dataset must be sufficiently large to represent the entire dataset, and the testing dataset must be adequate to evaluate the performance of a model [34].Most machine learning application studies use a ratio of 80% for training data and 20% for testing dataset [35].
The dataset used in this study consisted of 11 subjects under two different states.Therefore, there are two possible ways to divide the dataset into training and testing datasets: subject-dependent test and subject-independent test approaches.A subjectdependent test means that a small amount of data is taken from the dataset of each subject and labeled as the testing dataset, while a subject-independent test means that one subject is left out from all the subjects and labeled as the testing dataset.In this study, we preferred to use a subject-dependent test approach, as shown in Figure 4.There are two considerations when choosing a subject-dependent test approach.First, there have been some driving fatigue detection studies [23,28] reporting that a subject-independent test approach produces lower model performance than a subject-independent approach because the trained model does not learn the characteristics of the subject test data.Secondly, a subject-independent test approach will produce a different classification model each time the subject test is changed.
splitting ratio between the training and testing datasets because it can vary between stud-ies of machine learning applications.However, when splitting data, two conditions must be met: the training dataset must be sufficiently large to represent the entire dataset, and the testing dataset must be adequate to evaluate the performance of a model [34].Most machine learning application studies use a ratio of 80% for training data and 20% for testing dataset [35].
The dataset used in this study consisted of 11 subjects under two different states.Therefore, there are two possible ways to divide the dataset into training and testing datasets: subject-dependent test and subject-independent test approaches.A subject-dependent test means that a small amount of data is taken from the dataset of each subject and labeled as the testing dataset, while a subject-independent test means that one subject is left out from all the subjects and labeled as the testing dataset.In this study, we preferred to use a subject-dependent test approach, as shown in Figure 4.There are two considerations when choosing a subject-dependent test approach.First, there have been some driving fatigue detection studies [23,28] reporting that a subject-independent test approach produces lower model performance than a subject-independent approach because the trained model does not learn the characteristics of the subject test data.Secondly, a subject-independent test approach will produce a different classification model each time the subject test is changed.Before determining the ratio of the training and testing datasets, there is something that needs to be considered: the use of the HRV analysis method at the feature extraction phase in the driving fatigue detection framework (Figure 1).The HRV analysis method [36], used to extract features from NN interval data, suggests a 5 min window of observation to analyze heart rate variability in the short-term.The duration of 5 min is the reference duration of the window for analyzing heart rate variability in this study.Thus, the duration of 5 min within the total duration of the ECG recording of each subject is the shortest duration needed for the testing dataset.The remaining ECG recording duration for each subject is determined to be training data.The ratio of the testing dataset as a percentage can be calculated as follows [3]: where T to is the total ECG recording duration of each subject and T te is the ECG recording duration of each subject for the testing dataset.Both variables are measured in minutes.
Since the total ECG recording duration reference for the balanced dataset is 23.59 min, the total ECG recording duration for each subject (T to ) was set to 23.59 min.In addition, the minimum ECG recording duration needed for a window of HRV analysis is 5 min, so the ECG recording duration of each subject for the testing dataset (T te ) is set to 5 min.By using Equation ( 1), the ratio of the testing dataset was found to be approximately 22%.The remaining ECG recording duration was 18.59 min for the training data, resulting in an approximate ratio of 78%.In short, the ratio of the training and testing dataset is 78% to 22% for each subject.Afterwards, all the ECG data were labeled according to the type of dataset along with subject conditions: the sleep-good label as the alert state, and the sleep-bad label as the fatigued state.

Data Resampling
A number of studies on driving fatigue detection (Table 1) used the resampling method with a sample called a window or epoch: for example, a 5 min window [3,23,24,28], 1 min epoch [20], a 3 min window [2], and a thirty-five-second window [21].
According to previous studies [37,38], the diversity of datasets used in the learning model impacts the model's performance in ensemble learning.The diversity of a dataset can be increased by dividing it into smaller subsets.The resampling method can increase diversity by dividing the dataset into smaller data subsets.Our previous study [3] tested three types of resampling methods with five resampling scenarios on the driving fatigue detection framework.We concluded that resampling with the overlapping window method, specifically a resampling window duration of 300 s and an overlap window of 270 s, had the greatest impact on increasing the accuracy of the model's performance.This study uses the same resampling method as in [3] by dividing the entire NN interval dataset into several windows, with two adjacent windows overlapping, as illustrated in Figure 5.
min.By using Equation ( 1), the ratio of the testing dataset was found to be approximately 22%.The remaining ECG recording duration was 18.59 min for the training data, resulting in an approximate ratio of 78%.In short, the ratio of the training and testing dataset is 78% to 22% for each subject.Afterwards, all the ECG data were labeled according to the type of dataset along with subject conditions: the sleep-good label as the alert state, and the sleep-bad label as the fatigued state.

Data Resampling
A number of studies on driving fatigue detection (Table 1) used the resampling method with a sample called a window or epoch: for example, a 5 min window [3,23,24,28], 1 min epoch [20], a 3 min window [2], and a thirty-five-second window [21].
According to previous studies [37,38], the diversity of datasets used in the learning model impacts the model's performance in ensemble learning.The diversity of a dataset can be increased by dividing it into smaller subsets.The resampling method can increase diversity by dividing the dataset into smaller data subsets.Our previous study [3] tested three types of resampling methods with five resampling scenarios on the driving fatigue detection framework.We concluded that resampling with the overlapping window method, specifically a resampling window duration of 300 s and an overlap window of 270 s, had the greatest impact on increasing the accuracy of the model's performance.This study uses the same resampling method as in [3] by dividing the entire NN interval dataset into several windows, with two adjacent windows overlapping, as illustrated in Figure 5.The number of windows in the training dataset using resampling with the overlapping window method can be calculated as follows [3]: where T to is the total ECG recording duration of each subject, T wte is the window duration for each subject for the testing dataset, T wtr is the window duration for each subject for the training dataset, and T o is the duration between two adjacent and overlapping windows.All the variables are measured in seconds.The total ECG recording duration of each subject (T to ) was determined to be 23.59 min, or approximately 1415 s.According to [36], the minimum ECG recording duration needed for a window of HRV analysis is 5 min, so the window duration for each subject for the training and testing datasets (T wtr and T wte ) was set to 5 min, or 300 s.The overlap window duration (T o ) was set to 270 s.Referring to Equation (2), the number of windows in the training dataset (N tr ) was 27, and the number of windows in the testing dataset (N te ) was 1 for each subject and each state.

Feature Extraction
There are many methods to extract features from ECG data, as stated in a previous review [39], such as HRV analysis in the time domain, the frequency domain, and nonlinear approaches [36,40]; statistical features with feature-based information retrieval with a self-similarity matrix [41]; and wavelet features with the wavelet transform [42].However, the majority of driving fatigue detection studies (Table 1) extracted features from NN interval data using HRV analysis.These studies were carried out by Huang et al. [24], Kim and Shin [20], Murugan et al. [26], Kundinger et al. [23], Persson et al. [28], and Halomoan et al. [3].These studies showed that HRV analysis was proven to produce high-accuracy fatigue detection.In this study, we applied HRV analysis in the time domain, the frequency domain, and the non-linear approaches for feature extraction.These methods extracted 20 features in the time domain, 9 features in the frequency domain, and 29 features from the non-linear approach.The total number of features extracted was 58.All the feature extraction methods are described in detail in the following sub-sections.

Time Domain Approach
The use of HRV analysis in the time domain has been used in many driving fatigue detection studies, including those by Huang et al. [24], Murugan et al. [26], Kundinger et al. [23], Persson et al. [28], and Halomoan et al. [3].Moreover, one study [43] showed that the features of SDNN, RMSSD, SDSD, pNN50, MeanNN, heart rate (HR), and CVNN are useful to analyze fatigue states.Another study [44] reported that the logarithm of RMSSD was the most useful feature for identifying fatigue.In this study, we applied statistical analysis and geometrical analysis to extract 20 features, as shown in Table 2.These twenty features were extracted using the same method as in our previous study [3].

Frequency Domain Approach
There are several fatigue studies reporting that HRV analysis in the frequency domain can be used to measure fatigue.One such study [43] shows that the ratio of spectral power in the low-frequency band to spectral power in the high-frequency band, the normalized spectral power in the low-frequency band, and the normalized spectral power in the highfrequency band all change when the subject is fatigued.Another study [44] reported that HRV spectral analysis is more sensitive and informative than time domain HRV indices for monitoring fatigue status.Furthermore, the study presented in [46] reported that at least 10 features extracted from HRV analysis in the frequency domain can be used to distinguish alert and fatigued states.For these reasons, HRV analysis in the frequency domain is needed for measuring driver fatigue.In this study, we applied spectral analysis with the Welch method.We extracted nine features from the NN interval data, as shown in Table 3, using the same method as in our previous study [3].One study [47] reported that the combination of HRV analysis in the frequency domain and the non-linear approach can distinguish a person's psychological state better than HRV analysis in the frequency domain alone.Our previous study [3] showed that using HRV analysis on non-linear approaches with Poincare plot analysis and multifractal detrended fluctuation analysis methods improved the performance of specific classification models like random forest and AdaBoost.With these methods of non-linear feature extraction added to the driving fatigue detection framework, random forest produced an increase in cross-validated accuracy of 0.33% and testing accuracy of 9.09%.AdaBoost produced an increase in cross-validated accuracy of 0.84% and testing accuracy of 9.09%.These results motivated us to extract more non-linear features from NN interval data.
In this study, we applied the Poincare plot analysis method and multifractal detrended fluctuation as in our previous study [3].These methods extracted 25 non-linear features from NN interval data, as shown in Table 4.In addition, we proposed a method for extracting non-linear features from NN interval data that can be used to analyze the cognitive status and, thus, classify driving fatigue states.This paper proposes the use of heart rate fragmentation to extract non-linear features.It was proposed by Costa et al. [7] and has never been used or further investigated in previous driving fatigue studies (Table 1).Heart rate fragmentation is a biomarker of a form of sinoatrial instability characterized by the appearance of inflection in the RR intervals, despite the electrocardiogram showing sinus rhythm [52].It can be used to analyze cardiac interbeat intervals in the autonomic nervous system [53].Costa et al. [8] reported that heart rate fragmentation can be useful to monitor cognitive status.We hypothesized that heart rate fragmentation can be used to monitor driver cognitive status because previous studies [4][5][6] showed that there is an association between driver cognition, sleep loss, and driving fatigue.In addition, the dataset utilized in the proposed driving fatigue detection framework was created in a sleep-related fatigue scenario [29].Every subject was instructed to have sleep restrictions on two separate days.To achieve an alert state prior to simulated driving, the subjects were instructed to get at least seven hours of sleep.Another day prior to the simulated driving, to induce a fatigued driving condition, the subjects were instructed to sleep less than seven hours per day.Therefore, the subjects who had had good sleep had better cognition than the subjects who had experienced sleep deprivation.This method extracted four non-linear features from the NN interval data, as shown in Table 5.The total number of features extracted from the non-linear approach is 29, consisting of 7 features from the Poincare plot analysis, 18 features from the multifractal detrended fluctuation, and 4 features from the heart rate segmentation.

Feature Selection
Our previous study [3] showed that the use of non-linear feature extraction only improved the performance of certain classification models like random forest and AdaBoost, but not bagging and gradient boosting.This means one or more non-linear features were not relevant to the gradient boosting and bagging models.Such irrelevant features are referred to as redundant features, and they may represent more noise than valuable data [54].This motivated us to use the feature selection method in the proposed driver fatigue detection (Figure 1).In Table 1, most studies applied feature selection to select the relevant features to improve the model's performance.Each study offers the advantages of its feature selection method; however, we chose mutual information over other feature selection methods because our study used linear and non-linear feature extraction methods, and mutual information can capture both linear and non-linear relationships between variables.In addition, mutual information can measure the relevance of features to the target variable [9].
Mutual information (MI) is a measure of the dependence between two random variables and is interpreted as the amount of information that is gained about one random variable by observing the other [55].Mutual information can be used to measure the similarity between two random variables.Two random variables with high mutual information means that they are similar, indicating a large reduction in uncertainty.Two random variables with low mutual information means that they are dissimilar, indicating a small reduction in uncertainty.Two random variables with zero mutual information are independent of each other [56].The formula for the mutual information of two random variables, X and Y, whose joint probability distribution is defined by P XY (x, y), can be calculated as follows [57]: x,y P XY (x, y)log P XY (x, y) P X (x)P Y (y) = E P XY log P XY P X P Y where P X (x) and P Y (y) are the marginal probability distribution of X and Y. Mutual information is used to assess the relevance of features and targets in a dataset by calculating the mutual information between each feature and the target variable.
3.5.Hyperparameter Optimization and Classification Model 3.5.1.Hyperparameter Optimization Before optimizing the hyperparameters, the cross-validation method needs to be applied in the proposed driving fatigue detection to reduce the possibility of high variance or bias in the performance of the learning models [58].The k-fold cross-validation method was used on the training data after the feature extraction was completed.In this study, we chose 10-fold cross-validation by dividing the dataset into 10 parts of the same size.One part out of a total of ten parts is represented as a subset of data for validation, while the other nine parts are represented as a subset of data for training the model.As the dataset is divided into 10 parts, there are 10 possible validation iterations, resulting in 10 validation accuracy results.The average of all the validation accuracy results is called the cross-validation accuracy.
In machine learning applications, hyperparameter optimization is crucial for constructing the most effective model architecture with optimal hyperparameter configurations.Optimized hyperparameters can substantially boost model performance [59].In this study, we chose the hyperparameter optimization algorithm with a grid search strategy, which is easy to implement and can be run in parallel.The grid search strategy is carried out by manually setting up a grid of possible values and trying out all possible combinations of the hyperparameter values during model training [59].The hyperparameter grid values used for hyperparameter optimization are shown in Table 6.The best model with optimized hyperparameter values yields the highest validation accuracy.The final model is later evaluated with the testing dataset.

Classification Model
In our previous study [3], as shown in Table 1, AdaBoost produced a high crossvalidated accuracy of 98.82%, and random forest produced a high cross-validated accuracy of 97.98%.The proposed driving fatigue detection framework shown in Figure 1 uses the ensemble learning method as in our previous study [3] to classify two fatigue states.To evaluate the proposed framework, there are four ensemble learning models applied: AdaBoost, bagging, gradient boosting, and random forest.Further details regarding these four ensemble learning models can be found in [60].

Results and Discussion
In this paper, we propose improvements to the feature extraction phase by adding heart rate fragmentation and the feature selection phase using mutual information in the driving fatigue detection framework (Figure 1).All the results and effects of the proposed improvements to model performance are presented and discussed in this section.There are two types of lists of experiments presented: lists of experiments without feature selection and lists of experiments with feature selection.The first list consists of two experiments applied in the proposed driving fatigue detection framework (Figure 1) without feature selection: an experiment using 58 features in the proposed study and an experiment using 54 features in the previous study.Table 7 describes these experiments, which are used to evaluate the effect of 58 features with heart rate fragmentation included in the proposed study without feature selection on model performance.To conduct a more comprehensive feature analysis, we provide a description of the distribution of 58 features extracted from the training dataset in Appendix A. The second list of experiments is used to analyze the effect of a number of selected features using feature selection on the model's performance.Since there is no ideal method to select the best number of features, we used Chen et al.'s approach [61] of finding the optimal number of features based on mutual information to determine the ideal number of features.The method is based on the observation of model performance and the number of selected features.The more relevant features added to the model, the higher the model's performance.The more irrelevant features added to the model, the lower the model's performance.
The optimal number of features will be found at a turning point.With the number of selected features denoted as S, all the features denoted as F, and the optimal number of features denoted as h, the procedures to select a subset of features from the full feature space are described as follows: 1.
Conduct a feature ranking of F based on the mutual information score between the feature and the target variables.The results of all the mutual information between the feature and the target are ranked from the highest mutual information score to the lowest mutual information score.A higher mutual information score for a feature means that the feature and target variables have a dependency.Thus, the feature represents more useful information for classification [56].The result of feature ranking is the sequence of features denoted as F 1 2.
Train the model with the number of selected features, from the biggest number of selected features to the smallest number of selected features, and assess the model with the testing dataset.In this paper, we suggest seven experiments, selecting 87.5%, 75%, 62.5%, 50%, 37.5%, 25%, and 12.5% out of all the ranked features (F 1 )

3.
Plot the model performance (accuracy) against the number of selected features (S) and observe the results.The optimal number of features h will be found at the turning point where the highest testing accuracy of a classification model is plotted.
The seven experiments selecting 87.5%, 75%, 62.5%, 50%, 37.5%, 25%, and 12.5% of all the ranked features applied in the proposed driving fatigue detection framework (Figure 1) are further described in Table 8.This section is organized into three subsections that present and discuss the results.The first subsection analyzes the effect of feature selection on model performance.The second subsection analyzes the necessity of non-linear features in the proposed driving fatigue framework.The last subsection discusses comparisons of the model's performance in the proposed study and the previous study.7, which compares the performance of each classification model in the proposed study and the previous study without using feature selection.The difference between the proposed study and the previous study [3] is that heart rate fragmentation is one of the non-linear extraction methods applied in the proposed driving fatigue detection (Figure 1).
The 58 features with heart rate fragmentation included have the effect of increasing the cross-validated accuracy and testing accuracy on the gradient boosting model.The gradient boosting model resulted in an increase in cross-validated accuracy of 0.5%, from 98.66% with 54 features to 99.16% with 58 features.In addition, the gradient boosting model resulted in an increase in the testing accuracy of 9.09%, from 72.73% with 54 features to 81.82% with 58 features.The 58 features with heart rate fragmentation included do not have any effect on the AdaBoost model's performance.
The 58 features with heart rate fragmentation included have an effect on the random forest model, resulting in an increase in cross-validated accuracy of 0.44%, from 97.88% with 54 features to 98.32% with 58 features; however, there was a decrease in testing accuracy of 9.13%, from 86.36% with 54 features to 77.27% with 58 features.This indicates that the random forest model with 58 features is prone to overfitting due to the model being too complex.The complexity of the random forest depends on the number of trees, the depth of the trees, the number of samples, and the number of features [62].The random forest with 58 features may have higher accuracy, but it makes the model more complex, which consequently requires higher processing costs and increases the risk of overfitting.
The 58 features with heart rate fragmentation included have no effect on the performance of the bagging model in terms of cross-validated accuracy; however, it resulted in an increase in testing accuracy of 13.63%, from 63.64% with 54 features to 77.27% with 58 features.7, which compares the performance of each classification model in the proposed study and the previous study without using feature selection.The difference between the proposed study and the previous study [3] is that heart rate fragmentation is one of the non-linear extraction methods applied in the proposed driving fatigue detection (Figure 1).In short, the 58 features with heart rate fragmentation included have a significant effect only on the gradient boosting model's performance.This is possible because the gradient boosting model has the ability to capture complex variable interactions in data and perform implicit feature selection.It assigns higher importance to features that contribute more by minimizing the loss function during the boosting iterations.Consequentially, with many trees added to the model, the training error of the fitted model may be arbitrarily small, which can lead to poor generalization [63,64].
Since there are 58 features extracted from the NN interval data, the classification model becomes more complex.This increases the dimensionality.This could lead to less accurate predictions from the model [65].This is why the feature selection method is needed in the driving fatigue detection framework.

The Performance of Each Classification Model with Feature Selection
The results of the feature selection experiments in Table 8 are shown in Figure 7 to find the optimal number of features using Chen et al.'s approach [61].It shows the cross-validated accuracy and testing accuracy of each classification model with various numbers of selected feature comparisons.Table 9 provides detailed information on each classification model's performance with various selected features.
The number of selected features has a major influence on the performance of the classification model.For example, the AdaBoost model results in a testing accuracy of 68.18% and cross-validated accuracy of 97.81% for eight selected features (Table 9).Its deviation between the cross-validated accuracy and testing accuracy of 29.63% indicates that the model has high variance and fails to generalize accurately to testing data or unseen data.This happens because of insufficient information or because the selected features represent redundant features rather than useful features.This condition makes the model prone to overfitting.In Figure 7, it is found that the highest testing accuracy of the AdaBoost model is 86.36%, plotted with four peak points of different cross-validated accuracy, so it is difficult to find the optimal number of features.We consider using the cross-validated accuracy and model complexity factors to determine the optimal number of selected features for AdaBoost.The highest cross-validated accuracy of AdaBoost is 98.99%, so the optimal number of features is 37 for AdaBoost.
The highest testing accuracy of the bagging model is 86.36%, plotted with three peak points of different cross-validated accuracy (Figure 7).With the same approach as we used for AdaBoost, the highest cross-validated accuracy of the bagging model is 98.32%, so the optimal number of features is 29.
The highest testing accuracy of the gradient boosting model is 86.36%, plotted for the number of features of 22 in Figure 7.The highest cross-validated accuracy of gradient boosting is 99.66%, plotted for the number of features of 22.The optimal number of features is therefore 22 for gradient boosting.number of features of 22 in Figure 7.The highest cross-validated accuracy of gradient boosting is 99.66%, plotted for the number of features of 22.The optimal number of features is therefore 22 for gradient boosting.
Of the four ensemble models, the random forest model gave the most accurate testing results of 95.45% and cross-validated accuracy of 98.65%, with an optimal number of features of 44.This shows that the combination of the heart rate fragmentation method at the feature extraction phase and the mutual information method at the feature selection phase makes a very significant contribution to the performance of the random forest model.Of the four ensemble models, the random forest model gave the most accurate testing results of 95.45% and cross-validated accuracy of 98.65%, with an optimal number of features of 44.This shows that the combination of the heart rate fragmentation method at the feature extraction phase and the mutual information method at the feature selection phase makes a very significant contribution to the performance of the random forest model.

The Necessity of Non-Linear Features in the Proposed Driving Fatigue Framework
In Figure 7, it is clear that the random forest model results in the highest testing accuracy of 95.45% and cross-validated accuracy of 98.65% with an optimal number of features of 44.In this section, we analyze the importance of non-linear features among the 44 selected features in the proposed driving fatigue detection framework (Appendix B). Figure 8 shows the top 44 ranked features out of 58 ranked features, visualized as packed bubbles.A larger bubble represents a higher mutual information score.
The features obtained from the HRV analysis in the time domain dominate the majority of the large bubbles in Figure 8.This means that the features extracted by the HRV analysis in the time domain dominate the high scores of mutual information.Non-linear features dominate the second-largest majority of large bubbles.This means that features extracted in the time domain have higher dependencies and are more relevant than non-linear features on the target variable; however, the non-linear features have higher dependencies than the features extracted in the frequency domain, as it can be proved that the ranks occupied by features extracted in the frequency domain are from 35 to 42, which are lower ranks (Figure 9).Based on the categorization of the HRV analysis approach shown in Table 10, 20 features extracted in the time domain are all used for classification.The number of features extracted in the non-linear approach was 20 out of 29 features.The number of features extracted in the frequency domain was 4 out of 9 features.This suggests that the features extracted in the time domain have an important role as more useful features compared to the features extracted in the non-linear and frequency domains.This also aligned with the study presented in [23], which produced a high cross-validated accuracy of 97.37% (Table 1).The study extracted 26 features, consisting of 13 features in the time domain, 10 features in the frequency domain, and 3 non-linear features.With the correlation-based feature subset selection method, the study selected two features out of 26, namely, MaxHR and MinHR.MaxHR represents MinNN, ranked no. 3, and MinHR represents MaxNN, ranked no. 1 in our proposed study (Figure 9).Based on the categorization of the HRV analysis approach shown in Table 10, 20 features extracted in the time domain are all used for classification.The number of features extracted in the non-linear approach was 20 out of 29 features.The number of features extracted in the frequency domain was 4 out of 9 features.This suggests that the features extracted in the time domain have an important role as more useful features compared to the features extracted in the non-linear and frequency domains.This also aligned with the study presented in [23], which produced a high cross-validated accuracy of 97.37% (Table 1).The study extracted 26 features, consisting of 13 features in the time domain, 10 features in the frequency domain, and 3 non-linear features.With the correlation-based feature subset selection method, the study selected two features out of 26, namely, MaxHR and MinHR.MaxHR represents MinNN, ranked no. 3, and MinHR represents MaxNN, ranked no. 1 in our proposed study (Figure 9).
The performance of a classification model depends on the model complexity.The complexity of the random forest depends on the number of trees, the depth of the trees, the number of samples, and the number of features [62].With 44 selected features, both features extracted from the non-linear approach and the time domain play the same role in the complexity of the random forest model.This is because the selected features The performance of a classification model depends on the model complexity.The complexity of the random forest depends on the number of trees, the depth of the trees, the number of samples, and the number of features [62].With 44 selected features, both features extracted from the non-linear approach and the time domain play the same role in the complexity of the random forest model.This is because the selected features extracted in the time domain and from the non-linear approach are the same (20 out of 44 features).
The feature rankings for the non-linear features extracted by heart rate fragmentation are as follows (Figure 9): IALS at rank 13, PIP at rank 20, PSS at rank 23, and PAS at rank 26.This suggests that the features extracted using heart rate fragmentation occupied the top 26 features in terms of importance, meaning these features are relevant to the target variable and are useful information for classification.In addition, the heart rate fragmentation method can be effectively used to classify fatigue based on cognitive status.

Comparison of the Performance of Each Classification Model in the Proposed Study and the Previous Study and Model Selection
Figure 10 shows a comparison of the performance of the four classification models in our proposed study and our previous study [3].Overall, our proposed study can improve the performance of all classification models compared to our previous study [3] by adding a non-linear feature extraction of heart rate fragmentation in the feature extraction stage and mutual information for feature selection in the driving fatigue detection framework.For example, in our previous study [3], we reported that the AdaBoost model with 54 features produced the most optimal cross-validated accuracy of 98.82% and testing accuracy of 81.82%.In our proposed study, the AdaBoost model with 37 selected features produced a cross-validated accuracy of 98.99% and testing accuracy of 86.36%.
In our previous study [3], we used cross-validation as the main parameter for model selection to estimate generalization errors in the training dataset.As a result, AdaBoost was chosen as the most optimal model of four ensemble learning models because it had the highest cross-validated accuracy at 98.82%; however, the study presented in [66] reported that cross-validation is no longer an effective estimate of generalization to select a model because the study shows empirically that the risk of overfitting increases as the model becomes larger.This is why we consider the testing accuracy along with the confusion matrix metric, which represents the true model performance, for model selection.Figure 10 shows that the random forest model with 44 extracted features gave the highest testing accuracy of 95.45% with a cross-validated accuracy of 98.65%.In addition, the confusion matrix of the random forest model (Figure 11d) shows that it can correctly classify all of the fatigued state data and misclassify one alert state data as the fatigued state.In summary, the random forest model has the best generalization among the four ensemble learning models and was chosen as the best classification model in this proposed study.

Future Directions
The proposed driving fatigue detection framework has shown remarkable accuracy when compared to prior studies in driving fatigue detection (Table 1); however, the proposed method was tested using a simulated driving dataset with a sleep-related fatigue scenario.The performance of our proposed method requires further evaluation with a larger number of subjects in real-world driving scenarios.
When conducting real-world driving with sleep-related fatigue scenarios, driver safety must be considered carefully, as sleep deprivation can lead to microsleep episodes in drivers [67], potentially resulting in accidents.Following are some considerations that can be taken into account in real-world driving scenarios: 1.
Utilizing a co-driver: having a co-driver accompany the main driver during realworld driving can enhance safety by providing assistance and monitoring, thereby minimizing the risk of accidents.2.
Driving in monitored and controlled environments: conducting real-world driving scenarios in a controlled and monitored environment can help mitigate accident risks.

3.
Objective and periodic fatigue assessment: implementing objective and periodic fatigue assessments, overseen by experts in the scientific study of sleep, is crucial to monitoring drivers' conditions during the experiments.Our study utilized a dataset [29] specifically designed to distinguish between two states: alertness and fatigue.The dataset focuses on fatigue related to sleep deprivation.Sleep-deprived individuals tend to be more susceptible to attention lapses, microsleep episodes, and slower reaction times [67].These indications show decreased alertness, which substantially elevates the risk of accidents and traffic incidents.Consequently, future work needs to consider the driver's level of alertness, which can be evaluated in a periodic fatigue assessment using subjective and/or objective measures.Subjective fatigue measurement can be conducted using questionnaires, such as the Chalder Fatigue Scale [68], while objective fatigue measurement can be assessed by specialists in the field of sleep medicine or sleep specialists.These measurements should be performed at specific intervals and periodically during driving experiments.The results of these measurements represent the driver's degree of alertness and can serve as ground-truth data for training or validating the model.Therefore, the system can detect a driver's fatigue at a certain level and issue warnings accordingly.

Conclusions
Most of the features extracted in the time domain have high mutual information scores.This means that, compared to non-linear features, time domain features are more dependent on the target variable and have more relevance to it.Nevertheless, the inclusion of non-linear features contributes to the model's complexity and subsequently impacts the performance of a classification model.It has been demonstrated that the inclusion of four non-linear features of heart rate fragmentation in the feature extraction stage leads to improved model performance.For instance, in the bagging model without feature selection, there was a notable increase in testing accuracy of 13.63%, rising from 63.64% with 54 features to 77.27% with 58 features.Similarly, the gradient boosting model without feature selection exhibited an increase in testing accuracy of 9.09%, rising from 72.73% with 54 features to 81.82% with 58 features.
The number of selected features has a major influence on the performance of the classification model.Fewer selected features caused the model to fail to accurately generalize to testing data or unseen data because of insufficient information.More selected features made the model more complex, which can lead to overfitting.A method for finding the optimal number of features is needed to find the best model performance.
Our proposed study found that the random forest model, with an optimally selected number of 44 features, produced the most accurate results in testing at 95.45%, along with a cross-validated accuracy of 98.65%.This outcome surpasses the findings of our previous study [3], in which the AdaBoost model, employing 54 features, achieved a testing accuracy of 81.82% along with a cross-validated accuracy of 98.82%.The results from our proposed study demonstrate that the inclusion of the heart rate fragmentation method for feature extraction and the mutual information method for feature selection significantly enhances the performance of the random forest model.Figures A1-A3 visualize the distribution of features that are extracted using a timedomain approach.Figure A4 visualizes the distribution of features that are extracted using a frequency domain approach.Figure A5 visualizes the distribution of features that are extracted using a non-linear approach called Poincare plot analysis.Figures A6-A8 visualize the distribution of features that are extracted using a non-linear approach: multifractal detrended analysis.Figure A9 visualizes the distribution of features that are extracted using a non-linear approach: heart rate fragmentation.All the feature distributions are presented in a histogram plot.It consists of a series of bins (intervals) on the horizontal axis and the frequency of data points falling into each bin with bars on the vertical axis.Figures A10 and A11 visualize the distribution of features that are extracted using a time-domain approach.Figure A12 visualizes the distribution of features that are extracted using a frequency domain approach.Figure A13 visualizes the distribution of features that are extracted using a non-linear approach called Poincare plot analysis.Figure A14 visualizes the distribution of features that are extracted using a non-linear approach: multifractal detrended analysis.Figure A15 visualizes the distribution of features that are extracted using a non-linear approach: heart rate fragmentation.All the feature distributions are presented in a histogram plot.It consists of a series of bins (intervals) on the horizontal axis and the frequency of data points falling into each bin with bars on the vertical axis.
presented a study showing that it is important to first identify the causal factors of fatigue before deciding what measurement methods to use to detect driving fatigue.They presented two types of causal factors for driving fatigue: sleep-related fatigue and task-related fatigue.Sleep-related fatigue is caused by the circadian rhythm effect, prolonged wakefulness, and sleep deprivation.Task-related fatigue is a type of fatigue that results from the length and task demands of driving and is unrelated to any sleep-related causes.The type of fatigue presented in this paper is sleep-related fatigue, specifically caused by sleep deprivation.
illustrates the ECG recording time (measured in minutes) for each subject in two different driving fatigue states: alert and fatigued.

Figure 3 .
Figure 3. ECG recording durations for each subject in two different driving conditions: sleep-good, denoting the alert state, and sleep-bad, denoting the fatigued state.

Figure 3 .
Figure 3. ECG recording durations for each subject in two different driving conditions: sleep-good, denoting the alert state, and sleep-bad, denoting the fatigued state.

Figure 4 .
Figure 4.The illustration of data splitting with a subject-dependent test approach and data labeling with sleep-good and sleep-bad.Figure 4. The illustration of data splitting with a subject-dependent test approach and data labeling with sleep-good and sleep-bad.

Figure 4 .
Figure 4.The illustration of data splitting with a subject-dependent test approach and data labeling with sleep-good and sleep-bad.Figure 4. The illustration of data splitting with a subject-dependent test approach and data labeling with sleep-good and sleep-bad.

Figure 5 .
Figure 5.The illustration of resampling with the overlapping windows method.Figure 5.The illustration of resampling with the overlapping windows method.

Figure 5 .
Figure 5.The illustration of resampling with the overlapping windows method.Figure 5.The illustration of resampling with the overlapping windows method.

4. 1 .
Figure6shows the results of the experiments in Table7, which compares the performance of each classification model in the proposed study and the previous study without using feature selection.The difference between the proposed study and the previous study[3] is that heart rate fragmentation is one of the non-linear extraction methods applied in the proposed driving fatigue detection (Figure1).The 58 features with heart rate fragmentation included have the effect of increasing the cross-validated accuracy and testing accuracy on the gradient boosting model.The gradient boosting model resulted in an increase in cross-validated accuracy of 0.5%, from 98.66% with 54 features to 99.16% with 58 features.In addition, the gradient boosting model resulted in an increase in the testing accuracy of 9.09%, from 72.73% with 54 features to 81.82% with 58 features.The 58 features with heart rate fragmentation included do not have any effect on the AdaBoost model's performance.The 58 features with heart rate fragmentation included have an effect on the random forest model, resulting in an increase in cross-validated accuracy of 0.44%, from 97.88% with 54 features to 98.32% with 58 features; however, there was a decrease in testing accuracy of 9.13%, from 86.36% with 54 features to 77.27% with 58 features.This indicates that the random forest model with 58 features is prone to overfitting due to the model being too complex.The complexity of the random forest depends on the number of trees, the depth of the trees, the number of samples, and the number of features[62].The random forest with 58 features may have higher accuracy, but it makes the model more complex, which consequently requires higher processing costs and increases the risk of overfitting.The 58 features with heart rate fragmentation included have no effect on the performance of the bagging model in terms of cross-validated accuracy; however, it resulted in an increase in testing accuracy of 13.63%, from 63.64% with 54 features to 77.27% with 58 features.

4. 1 .
Figure6shows the results of the experiments in Table7, which compares the performance of each classification model in the proposed study and the previous study without using feature selection.The difference between the proposed study and the previous study[3] is that heart rate fragmentation is one of the non-linear extraction methods applied in the proposed driving fatigue detection (Figure1).

Figure 6 .
Figure 6.The cross-validated accuracy and testing accuracy of each classification model without feature selection.Figure 6.The cross-validated accuracy and testing accuracy of each classification model without feature selection.

Figure 6 .
Figure 6.The cross-validated accuracy and testing accuracy of each classification model without feature selection.Figure 6.The cross-validated accuracy and testing accuracy of each classification model without feature selection.

Figure 7 .Table 9 .Figure 7 .
Figure 7.The cross-validated accuracy and testing accuracy of each classification model with various numbers of selected feature comparisons.Table 9. Detailed information on the cross-validated accuracy and testing accuracy of each classification model without feature selection and with feature selection.Feature Selection Features Classification Model

Information 2023 ,
14, x FOR PEER REVIEW 19 of 43the ranks occupied by features extracted in the frequency domain are from 35 to 42, which are lower ranks (Figure9).

Figure 8 .
Figure 8.The top 44 ranked features visualized as packed bubbles.

Figure 8 .
Figure 8.The top 44 ranked features visualized as packed bubbles.

Figure 9 .
Figure 9.The top 44 ranked features with the mutual information scores.

Figure 9 .
Figure 9.The top 44 ranked features with the mutual information scores.

Information 2023 , 43 Figure 10 .Figure 11 .
Figure 10.Comparison of cross-validated accuracy and testing accuracy of AdaBoost, bagging, gradient boosting, and random forest in the previous study with 54 features and the proposed study.

Figure 10 .
Figure 10.Comparison of cross-validated accuracy and testing accuracy of AdaBoost, bagging, gradient boosting, and random forest in the previous study with 54 features and the proposed study.

Figure 10 .Figure 11 .
Figure 10.Comparison of cross-validated accuracy and testing accuracy of AdaBoost, bagging, gradient boosting, and random forest in the previous study with 54 features and the proposed study.

Figure 11 .
Figure 11.Confusion matrix of the testing dataset for all classification models with selected features: (a) Adaboost with 37 selected features; (b) bagging with 29 selected features; (c) gradient boosting with 22 selected features; (d) random forest with 44 selected features.

1 A1Figure A1 .
Figure A1.The distribution of MeanNN, SDNN, SDSD, SDANN, SDNNI, RMSSD, and CVNN features that are extracted using a time-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A2 .
Figure A2.The distribution of CVSD, MedianNN, MadNN, MCVNN, IQRNN, Prc20NN, and Prc80NN features that are extracted using a time-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A2 .
Figure A2.The distribution of CVSD, MedianNN, MadNN, MCVNN, IQRNN, Prc20NN, and Prc80NN features that are extracted using a time-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A3 .
Figure A3.The distribution of pNN50, pNN20, MinNN, MaxNN, TINN, and HTI features that are extracted using a time-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A3 .
Figure A3.The distribution of pNN50, pNN20, MinNN, MaxNN, TINN, and HTI features that are extracted using a time-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A4 .
Figure A4.The distribution of VLF, LF, HF, VHF, LFHF, LFn, HFn, and LnHF features that are extracted using a frequency-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A4 .
Figure A4.The distribution of VLF, LF, HF, VHF, LFHF, LFn, HFn, and LnHF features that are extracted using a frequency-domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A5 .
Figure A5.The distribution of SD1, SD2, SD1/SD2, S, CSI, CVI, and modified CSI features that are extracted using a non-linear approach: Poincare plot analysis.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A5 .
Figure A5.The distribution of SD1, SD2, SD1/SD2, S, CSI, CVI, and modified CSI features that are extracted using a non-linear approach: Poincare plot analysis.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A6 .
Figure A6.The distribution of DFA α1, MDFA α1-width, MDFA α1-peak, MDFA α1-mean, MDFA α1-max, and MDFA α1-delta features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A7 .
Figure A7.The distribution of MDFA α1-asymmetry, MDFA α1-fluctuation, MDFA α1-increment, DFA α2, MDFA α2-width, and MDFA α2-peak features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigues states.Each feature has a number of records: 297 for the alert state and 297 for the fatigues state.

Figure A7 .
Figure A7.The distribution of MDFA α1-asymmetry, MDFA α1-fluctuation, MDFA α1-increment, DFA α2, MDFA α2-width, and MDFA α2-peak features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigues states.Each feature has a number of records: 297 for the alert state and 297 for the fatigues state.

Figure A8 .
Figure A8.The distribution of MDFA α2-mean, MDFA α2-max, MDFA α2-asymmetry, MDFA α2fluctuation, and MDFA α2-increment features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A8 .
Figure A8.The distribution of MDFA α2-mean, MDFA α2-max, MDFA α2-asymmetry, MDFA α2-fluctuation, and MDFA α2-increment features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A9 .
Figure A9.The distribution of PIP, IALS, PSS, and PAS features that are extracted using a non-linear approach: heart rate fragmentation.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A9 .
Figure A9.The distribution of PIP, IALS, PSS, and PAS features that are extracted using a non-linear approach: heart rate fragmentation.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 297 for the alert state and 297 for the fatigued state.

Figure A11 .
Figure A11.The distribution of IQRNN, SDNN, CVNN, pNN20, pNN50, SDNNI, MadNN, SDANN, HTI, and MCVNN features that are extracted using a time domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A12 .
Figure A12.The distribution of VHF, HFn, LFn, and LFHF features that are extracted using a frequency domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A12 .
Figure A12.The distribution of VHF, HFn, LFn, and LFHF features that are extracted using a frequency domain approach.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A13 .
Figure A13.The distribution of SD1, modified CSI, S, CVI, SD2, CSI, and SD1/SD2 features that are extracted using a non-linear approach: Poincare plot analysis.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A13 .
Figure A13.The distribution of SD1, modified CSI, S, CVI, SD2, CSI, and SD1/SD2 features that are extracted using a non-linear approach: Poincare plot analysis.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A14 .
Figure A14.The distribution of MDFA α2-peak, MDFA α2-width, MDFA α1-mean, MDFA α2-fluctuation, MDFA α2-delta, MDFA α2-max, MDFA α2-mean, MDFA α1-max, and MDFA α2-increment features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A14 .
Figure A14.The distribution of MDFA α2-peak, MDFA α2-width, MDFA α1-mean, MDFA α2fluctuation, MDFA α2-delta, MDFA α2-max, MDFA α2-mean, MDFA α1-max, and MDFA α2increment features that are extracted using a non-linear approach: multifractal detrended fluctuation analysis (MFDFA).The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A15 .
Figure A15.The distribution of IALS, PIP, PSS, and PAS features that are extracted using a nonlinear approach: heart rate fragmentation.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Figure A15 .
Figure A15.The distribution of IALS, PIP, PSS, and PAS features that are extracted using a non-linear approach: heart rate fragmentation.The horizontal axis represents the range of feature values, while the vertical axis illustrates the number of records, providing insight into the distribution of feature values across both alert and fatigued states.Each feature has a number of records: 11 for the alert state and 11 for the fatigued state.

Table 1 .
A detailed review of studies on fatigue or drowsy driving detection using ECG from 2018 to 2023.An asterisk (*) denotes the cross-validated accuracy; a double asterisk (**) denotes the testing accuracy; and a triple asterisk (***) denotes the training accuracy.

Table 2 .
Extracted features using HRV analysis in the time domain.

Table 3 .
Extracted features using HRV analysis in the frequency domain.

Table 4 .
Extracted features using HRV analysis in a non-linear approach: Poincare plot analysis and multifractal detrended fluctuation.

Table 5 .
Extracted features using HRV analysis in a non-linear approach: heart rate fragmentation.

Table 6 .
The hyperparameter values for the grid search strategy.

Table 7 .
The summary of all experiments applied on the proposed driving fatigue detection in comparison to our previous study without feature selection.
* PPA stands for Poincare plot analysis.MFDFA stands for multifractal detrended fluctuation analysis.HRA stands for heart rate fragmentation.

Table 8 .
The summary of all experiments applied to the proposed driving fatigue detection framework with feature selection.

Table 9 .
Detailed information on the cross-validated accuracy and testing accuracy of each classification model without feature selection and with feature selection.

Table 10 .
The categorization of 58 features (without feature selection) and 44 features (with feature selection) based on the HRV analysis approach.
PPA stands for Poincare plot analysis.MFDFA stands for multifractal detrended fluctuation analysis.HRA stands for heart rate fragmentation.