1. Introduction
Road accidents are one of the most frequent and devastating occurrences that can lead to loss of life and financial damage [
1]. There are several factors that contribute to road accidents, such as the drivers’ mental and physical states, technical malfunctions of equipment, the influence of drugs, driver fatigue and drowsiness, and other related factors [
2]. Drowsiness is a condition that occurs when one transitions from being awake to asleep, and vice versa. It is characterized by a decrease in alertness and ability to make quick decisions [
3]. This can be particularly dangerous for drivers, as statistics from around the world have shown that driver fatigue is a leading cause of traffic accidents. To address this, car companies have invested millions of dollars in developing warning systems that would detect drowsiness. As the number of on-road vehicles continues to increase, the problem of drowsy driving is becoming more prevalent [
4]. Therefore, it is crucial to identify signs of driver drowsiness and develop intelligent drowsiness detection systems to prevent and decrease the occurrence of accidents. The accurate detection of drowsiness is a primary objective in the advancement of novel driver assistance systems and advanced drowsiness detection methods. Drowsiness can be assessed in three ways: based on behavior, vehicle, and physiological criteria [
5].
Contemporary driver-assistance platforms including those from Tesla, Volvo, and Mercedes-Benz incorporate camera-based fatigue monitoring to detect distraction and drowsiness. Despite their practicality, such systems are vulnerable to environmental variation (lighting, viewpoint, occlusion) and may pose privacy concerns. Physiological sensing, particularly EEG, offers a complementary pathway by capturing neural correlates of drowsiness that are less dependent on external conditions. Accordingly, the proposed framework is designed to augment existing in-vehicle monitoring solutions.
The EEG data analyzed here originate from the MIT-BIH Polysomnography Database collected in controlled clinical sleep laboratories, not driving settings. As such, this study is a methodological investigation to develop and evaluate an EEG-based classifier that can subsequently be adapted and tested in more ecologically valid driving scenarios.
One way to evaluate drowsiness in drivers is by observing their behavior, such as yawning, head movements, blink rate, and duration of eyelid closure. Using a camera to capture images of drivers’ faces is a noninvasive and convenient method, but it can be affected by environmental factors such as lighting, camera movements, and viewing angles. However, drivers may be uncomfortable with direct camera monitoring, and adjusting the camera to monitor the driver’s alertness or drowsiness involves fine-tuning parameters such as viewing angle, focal length, lens distortion correction, and coordinating with the installation environment including ambient light and positioning. This precise adjustment of hardware and software is needed for each driver. Furthermore, relying solely on behavioral criteria may not always accurately detect drowsiness. Also, evaluating behavioral criteria alone may not provide accurate results in detecting drowsiness. Vehicle-based metrics make use of one or more sensors that are embedded inside the car and can continuously monitor the driver’s head, hand, and eye movements [
6,
7]. Car-based criteria comprise various factors such as the position of the wheels, hand movements, speed, acceleration of the vehicle, etc. These criteria are noninvasive, but they highly depend on the driver’s driving skills, road conditions, and vehicle characteristics [
8]. Although the advantage of these criteria is that they are noninvasive, the disadvantage is that they take a long time to detect the presence of a problem in the driver, and they are unable to prevent an accident in real driving conditions [
9]. Physiological criteria are an alternative and complement to vehicle and behavioral criteria. These criteria include various indicators, such as heart rate, brain activity, breathing rate, etc., which can be obtained from sensors recording signals, including Electrocardiogram (ECG), Electroencephalogram (EEG), Photoplethysmography (PPG), etc. [
10,
11]. It is possible to analyze the level of drowsiness through heart activity. Heart Rate Variability (HRV) is used to measure the variance of intervals between heartbeats, which is a valid indicator of the activity of the cardiovascular system in different human physiological states [
12]. One of the drawbacks of medical signal-based methods necessity for sensors and cables to be attached to the body for recording [
13,
14,
15]. However, this problem can be effectively addressed by incorporating new wireless sensors, such as smart watches and wearables. The brain’s electrical activity exhibits distinct wave patterns in certain areas, and the differences in these patterns between alertness and drowsiness have been extensively researched. Several techniques have been developed to process the EEG signal and identify drowsiness [
16,
17,
18,
19].
To provide context for our contribution,
Table 1 presents a comparative overview of selected EEG-based drowsiness detection studies. The table outlines key methodologies, feature types, reported accuracy rates, and associated limitations. This comparison highlights the effectiveness of our ensemble-based model, which achieves competitive performance using exclusively handcrafted EEG features.
The level of vigilance is tightly linked to activity in specific brain regions [
23]. We propose a lightweight, interpretable EEG-based method for drowsiness detection using data from the MIT-BIH Polysomnography database, which contains multi-channel recordings from clinical sleep studies. To promote computational simplicity and near–real-time feasibility, all experiments use a single EEG derivation (C4–A1)—a choice supported by prior evidence that single-channel EEG, when paired with robust processing, can capture drowsiness-related neural dynamics [
7,
11].
Signals were preprocessed and segmented into 30 s epochs, and 61 handcrafted features—covering linear, nonlinear, and frequency-based descriptors—were extracted. These features served as inputs to KNN, SVM, DT, and a bagging ensemble. The objective is an accurate, efficient, and explainable pipeline suitable for driver-monitoring contexts.
Consistent with the nonstationary and nonlinear nature of EEG during sleep, the selected features capture evolving temporal and complexity patterns associated with alertness–drowsiness states. (
Section 2 reviews prior work;
Section 3 details the dataset;
Section 4 describes the methodology;
Section 5 reports evaluations;
Section 6 concludes with implications and future directions.)
Contributions:
A large-scale, balanced dataset of 6212 labeled 30 s EEG segments drawn from >80 h of MIT-BIH polysomnography.
Design and extraction of 61 handcrafted features (linear, nonlinear, frequency-based) chosen for robustness to signal noise/quality, offering broader coverage than narrowly statistical or purely deep-learning approaches.
A comparative analysis across multiple classifiers (KNN/SVM/DT) and a DT-based bagging ensemble, all trained and evaluated on the same data.
Bayesian hyperparameter optimization and performance reporting with six metrics—Accuracy, Precision, Sensitivity, Specificity, F1, and MCC—to support robust evaluation.
We posit that combining statistical, frequency, and model-based EEG features can reliably separate alert from drowsy states while preserving interpretability. Although EEG is sensitive to subtle neural changes, real-world deployment faces challenges (user comfort, electrode placement, motion artifacts). Accordingly, this work is an exploratory feasibility study under controlled conditions, identifying salient EEG features and evaluating interpretable ML models as a foundation for future simplified or hybrid solutions that pair physiological markers with practical deployment strategies. For deployment, single-channel designs are attractive because they reduce computational load and can be implemented with wearable or in-cabin sensors; such engineering integration is discussed as future work rather than a component of the present clinical analysis.
Because drowsiness exhibits consistent EEG signatures—elevated theta, reduced beta, transient alpha bursts, and lower nonlinear complexity—we frame the task as a physiology-centric EEG state-classification problem rather than a driving-performance study. Clinically annotated transitions enable precise labeling for an interpretable baseline model, and we outline steps for adaptation to simulator/on-road data where domain shift and artifact profiles differ.
2. Literature Review and Study Contributions
2.1. Related Work on EEG-Based Drowsiness Detection
Previous studies have explored the use of photoplethysmography (PPG) signals to examine the connection between heart rate variability and stress levels in individuals. A crucial parameter in this context is pulse transit time, which correlates closely with blood pressure changes [
13,
18,
22]. Additionally, the pulsatile blood flow causes subtle color variations on the skin surface. To utilize this effect, certain methods deploy low-frame-rate cameras to capture facial movements, enabling the estimation of pulse transit time and subsequent reconstruction of the PPG signal [
18,
22,
23,
24]. These techniques have shown promising applicability in vehicular environments. PPG offers a noninvasive means to assess vascular properties such as arterial stiffness, elasticity, and microvascular blood volume fluctuations [
21]. The cardiac cycle produces a pressure wave that propagates blood through tissues, causing volume changes detectable by illuminating the skin with a light-emitting diode (LED) and measuring the transmitted light via a photodiode sensor.
Figure 1 illustrates the architecture of a typical PPG signal acquisition system [
16].
Drowsiness can be inferred from heart rate variability (HRV), estimated from troughs of the PPG signal. A practical advantage of PPG is single-hand acquisition, which avoids requiring both hands on the steering wheel; however, such approaches often rely on relatively complex hardware (e.g., steering-wheel ECG/PPG sensors) and continuous skin contact, which may be impractical in real use [
24]. Alternative lines of work reconstruct PPG from facial video to estimate HRV and driver state, but these methods are sensitive to camera placement, lighting, and per-user calibration, and purely behavioral/facial cues can be unreliable due to inter-subject variability and algorithmic constraints. In contrast, the present study focuses exclusively on EEG, offering a direct physiological assessment of vigilance that does not depend on visual inputs or vehicle-embedded sensors.
In the literature, two broad families of drowsiness evaluation have been reported. The first uses EEG, which underpins applications in gaming, psychotherapy, drowsiness assessment, and certain neurorehabilitation contexts [
25]. Prior EEG work commonly employs frequency-domain features—e.g., PSD or wavelet-based descriptors—and trains ANN classifiers, with accuracies around 84.1% in some reports. For example, Chen et al. [
20] and Delimaynati et al. [
9] combined wavelet-band and Fourier-based spectral EEG features with EOG eyelid-movement cues and classified them using Extreme Learning Machine (ELM), a fast single-hidden-layer approach that sets input weights randomly and solves output weights by least squares; they reported accuracies up to 95.6%. (ELM and the referenced multimodal features are part of prior work and are not used in the present study.)
A second stream integrates traditional signal processing with deep-learning feature extraction to improve accuracy. Reported features include energy distribution, zero-crossing velocity, spectral entropy, and instantaneous frequency structures [
7,
9,
10]. In these studies, alpha-band activity is often isolated from EDF-formatted PhysioNet EEG, PSD powers in delta/theta/alpha are estimated (typically via FFT), and classifiers such as ANN and SVM are evaluated with accuracy and ROC metrics. A representative block diagram of such pipelines appears in
Figure 2.
In prior work, feature extraction commonly combines time-domain statistics (e.g., mean, variance, Hjorth parameters) with frequency-domain analyses (e.g., band power and power spectral density, PSD) and time–frequency methods (e.g., wavelet decomposition). Such multi-domain pipelines provide a comprehensive characterization of temporal and spectral patterns underlying transitions between alertness and drowsiness, followed by a classification stage.
Figure 3 illustrates representative EEG traces for alert versus drowsy conditions.
In related work, EEG preprocessing has often used a two-stage filtering pipeline: a bidirectional Butterworth filter followed by a low-pass stage (cutoff within 0.5–60 Hz), with adaptive filters to suppress biological artifacts (e.g., speech, eye movements) and power-line interference [
22]. While some studies segmented EEG into 5 s epochs for PSD estimation under a quasi-stationarity assumption, that segmentation strategy was not adopted in the present study.
For feature selection and interpretability, prior work [
22] applied Linear Discriminant Analysis (LDA) with a stepwise (forward/backward) procedure based on Lambda Prediction (LW) to rank feature importance. Using the most discriminative features, Artificial Neural Networks (ANNs) were trained: 21 three-layer architectures were explored with input sizes of 8/12/27 features, 10–40 hidden neurons, tansig output activation, and Levenberg–Marquardt training. Data were split 70%/30% for train/test, with validation after each training cycle; results (metrics and confusion matrices) are reported in
Table 2 and
Table 3 of that study [
22].
Note: Techniques such as PSD segmentation, LDA-based feature selection, and ANN classification are discussed solely as part of prior literature and are not incorporated into the methodology of the present study.
2.2. Contribution of This Study
Building upon the strengths and addressing the limitations of previous EEG-based drowsiness detection research, this study introduces a lightweight and interpretable framework utilizing 61 handcrafted features extracted from time-domain statistics, frequency-domain energy distributions, and model-based parameters to classify states of drowsiness and alertness. Unlike deep learning models that demand extensive datasets and high computational resources, our approach emphasizes simplicity and transparency by employing machine learning classifiers such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), and ensemble bagging methods. Diverging from prior techniques that depend on multiple EEG channels or auxiliary modalities, we demonstrate that effective classification can be achieved using a single EEG electrode, facilitating real-time implementation in embedded or wearable devices. Our comparative analysis further substantiates that ensemble learning markedly enhances classification accuracy while maintaining computational efficiency.
While previous studies, applied Linear Discriminant Analysis (LDA) for feature selection, our approach intentionally avoids LDA or any dimensionality reduction methods to retain complete interpretability of the extracted features.
3. Data Description
This study uses the MIT-BIH Polysomnography Database, comprising multi-physiological overnight recordings acquired at the Beth Israel Hospital Sleep Laboratory (Boston) for the monitoring of obstructive sleep apnea and evaluation of Continuous Positive Airway Pressure (CPAP) therapy. For analysis, EEG was segmented into 30 s epochs and mapped to two classes: Alertness, defined as expert-labeled Wake (W) with predominant beta activity; and Drowsiness, defined as stages N1–N3, with N1 reflecting microsleep-like episodes relevant to driving contexts. REM epochs were excluded because their physiology is not directly aligned with the wake-to-sleep transition of interest. Using this binary scheme, we assembled a balanced set of 6212 EEG epochs.
Overall, the database provides > 80 h of polysomnography across four-, six-, and seven-channel montages. Each record includes beat-by-beat labeled ECG and EEG/respiratory channels annotated for sleep staging and apnea events [
26,
27]. The dataset has been widely employed in research on sleep staging and vigilance, and its clinically curated annotations underpin numerous EEG studies [
26,
27].
Scope note: EEG was obtained in a clinical polysomnography paradigm to support EEG-based state classification (wakefulness vs. drowsiness), rather than assessment of driving behavior; implications for deployment in vehicles are addressed in the Discussion. Recordings were sampled at 256 Hz with 16-bit resolution. Although multi-channel signals are available, the present analysis focuses on the C4–A1 derivation.
4. Methodology
4.1. EEG Acquisition
Scalp EEG from the MIT-BIH Polysomnography Database was acquired in a clinical sleep laboratory using standard PSG instrumentation and expert scoring. The analysis focused on the C4–A1 derivation, a central lead commonly used for drowsiness studies due to reduced ocular contamination and sensitivity to sleep-onset dynamics. Signals were recorded with clinical grounding and digitized at the database’s native specifications (256 Hz, 16-bit). Sleep stages were scored by clinical experts according to standard guidelines, and labels were mapped to Wake vs. drowsiness as defined in the labeling section. To attenuate acquisition-related artifacts, EEG was band-pass filtered 0.5–30 Hz (relevant alertness/drowsiness bands); no notch filter was applied given minimal power-line contamination in this dataset and adequate suppression by the band-pass.
4.2. Signal Preprocessing
EEG preprocessing used the MIT-BIH Polysomnography data segmented into 30 s, expert-labeled epochs. For computational simplicity and real-time suitability, analysis was restricted to the C4–A1 channel. The pipeline comprised:
Band-pass filtering (0.5–30 Hz, 4th-order, zero-phase Butterworth) to retain components relevant to alertness/drowsiness while suppressing slow drift and high-frequency noise; no notch filter was applied given minimal power-line contamination and adequate attenuation by the band-pass.
Segmentation and labeling: 30 s epochs kept their clinical labels; Wake (W) epochs were treated as alert, and N1–N3 as drowsy; REM was excluded.
All steps follow established EEG practices and were implemented with custom MATLAB R2024a scripts. The focus on C4–A1 reflects its sensitivity to sleep-onset dynamics with reduced ocular contamination, consistent with prior drowsiness literature [
7,
11].
To reduce baseline drift, signals were mean-centered and de-trended (least-squares line removal) within each labeled epoch following band-pass filtering. Processing parameters were derived from the training portion only during evaluation to prevent information leakage. This filtering configuration aligns with common EEG protocols in drowsiness and sleep studies [
7,
20].
4.3. Feature Extraction
From each 30 s EEG epoch (C4–A1), we extracted 61 handcrafted features capturing temporal statistics, nonlinear dynamics, and model-based descriptors. The feature set comprised:
Time-domain statistics: mean, variance, skewness, kurtosis, Hjorth parameters, zero-crossing rate, and related waveform-shape indices.
Nonlinear/complexity metrics: Shannon entropy, fractal dimension, Hurst exponent, and Detrended Fluctuation Analysis (DFA).
Model-based descriptors: low-order autoregressive (AR) coefficients, signal-energy measures, and dominant-frequency estimates.
Clarification on spectral methods and non-stationarity. No Power Spectral Density (PSD) or other Fourier-based descriptors were computed in this study; any PSD discussion appears only in the literature review to contextualize prior work. The present feature set is limited to time-domain statistics, nonlinear/complexity metrics, and low-order AR descriptors, which are less sensitive to EEG non-stationarity.
AR-order selection. The AR model order was fixed at 10 based on (i) empirical sweeps (orders 4–14) showing no consistent performance gain beyond 10 but increasing parameter variance, and (ii) established EEG practice for 256 Hz data below 30 Hz, where 8–12 coefficients provide a balanced trade-off between spectral fidelity and stability.
Given the nonstationary nature of EEG signals, the 30 s EEG epochs were divided into 500 ms rectangular windows with a 400 ms overlap to ensure local stationarity. Within each short window, the signal can be reasonably considered quasi-stationary, allowing reliable estimation of autoregressive (AR) parameters.
A 10th-order AR model was fitted to each window using the Yule–Walker method implemented in MATLAB’s Signal Processing Toolbox. This window length (500 ms) provided an optimal balance between temporal resolution and the statistical reliability of parameter estimation, given the sampling rate of over 200 Hz. Such a setting ensures sufficient data points per window for stable parameter estimation while maintaining sensitivity to short-term EEG dynamics.
For each window, ten AR coefficients (a1–a10) were obtained. To summarize these parameters at the segment level, coefficients corresponding to the same order (e.g., a1 across all windows) were averaged across all windows, resulting in ten representative AR features per 30 s EEG segment. This averaging approach preserves the overall temporal structure of the signal while minimizing the influence of transient fluctuations.
All coefficients were z-score normalized prior to aggregation to ensure consistency across different windows and subjects. This procedure ensures that AR modeling was performed under locally stationary conditions and that the resulting features robustly capture the underlying temporal dynamics of the EEG signal. The adopted framework aligns with established practices in EEG autoregressive and spectral modeling literature, which recommend sub-second windowing for reliable AR estimation on nonstationary EEG data.
All features were computed from the preprocessed signals described in
Section 4.1. Windowing was applied only for the AR-based (model-derived) descriptors as detailed above, where 500 ms overlapping windows were used to ensure local stationarity. In contrast, all other features–time-domain, statistical and nonlinear/complexity measures–were directly extracted from the full 30 s EEG epochs without further segmentation. To improve numerical stability, both min–max and z-score normalization were applied, and all features were evaluated individually and jointly as classifier inputs. No feature selection or dimensionality-reduction step was applied; all 61 features were retained to preserve interpretability and traceability to the underlying EEG mechanisms (
Table 4).
Each 30 s EEG segment was processed using a custom MATLAB R2024a implementation of standard signal-processing routines. The extracted features were organized into three principal categories: (1) time-domain statistics (e.g., mean, RMS, variance/kurtosis/skewness, SNR); (2) model-based time-series descriptors derived from a 10th-order AR model, including dominant frequency, damping ratio, and residual-error statistics; and (3) nonlinear/complexity measures such as entropy-based indices and DFA, as detailed in
Table 4.
Signals were detrended after band-pass filtering, and features were computed from the resulting preprocessed C4–A1 epochs. The extracted feature set was subsequently used to train and evaluate well-established classifiers—KNN, SVM, DT, and a DT-based bagging ensemble (EL)—commonly adopted in EEG classification research [
28,
29]. Performance was assessed using Accuracy, Precision, Sensitivity, Specificity, F1-Score, and Matthews Correlation Coefficient (MCC).
4.4. Classification Algorithms
ML algorithms are extensively employed for classification in medical diagnosis. This research assesses the efficacy of SVM, KNN, DT, and DT-based bagging EL algorithms in the classification of drowsiness and alertness. The selection of these algorithms was based on their unique operational features and prevalence in research. The following is a comprehensive overview of the functioning of each algorithm.
4.4.1. Support Vector Machine (SVM)
The SVM is a powerful tool that is used for classifying data effectively [
24]. Its operational principle is to find a linear separator between different classes of data, with the aim of maximizing the distance from each class. This technique works well for data with only two classes, but for data with multiple classes, it must be calculated in pairs.
Figure 4 shows the schematic of SVM algorithm.
4.4.2. K-Nearest-Neighbor (KNN)
KNN is a statistical technique used for classification and regression. K refers to the closest training samples in the data space [
30]. An unlabeled test sample is classified among its KNNs in the training set. Various methods are used to calculate neighborhood distance or assign weight to different neighbors.
4.4.3. Decision Tree (DT)
A DT is a hierarchical model that plays a crucial role in decision-making processes [
31]. It considers chance events, resource costs, and utility, and is presented as a tree structure featuring nodes that represent decisions or conditions. The algorithm uses specific criteria such as entropy or Gini Impurity to categorize data into different categories. Due to their high interpretability and readability, DTs are essential in data mining, artificial intelligence decision-making, and ML.
4.4.4. Bagging Ensemble Learning
Bagging is an effective EL technique that aims to minimize learning errors by utilizing a group of homogenous ML models [
32,
33]. The objective of this technique is to reduce variance, which subsequently results in lower classification or regression errors [
34,
35]. The process involves selecting the number and type of basic models, followed by choosing training data for each model using the bootstrap approach, which entails random sampling with replacement. The training data may be selected multiple times, and it is divided into several subsets based on the bootstrap approach. The basic models are then trained separately on each of these subsets, resulting in different knowledge and opinions for a specific input. Although the basic models are of the same type, each model is trained with a different set, leading to variations in knowledge of each model. During the testing process, the trained models provide an output estimation for new data based on their knowledge. Finally, all models are combined to estimate the new data output. In a classification problem, the models use a simple voting mechanism to determine the new data’s class, with the class having the most votes considered the winner. In contrast, in a regression problem, the models use a simple averaging of their opinions instead.
During preliminary experiments, we assessed several feature-selection and dimensionality-reduction strategies; however, none yielded consistent improvements over the complete 61-feature set, so their details are omitted here for brevity and all features were retained. In the final workflow, no dimensionality-reduction or algorithmic feature-selection step was applied; keeping all 61 handcrafted features preserves interpretability and full physiological coverage. Potential overfitting was addressed via model-specific hyperparameter constraints (e.g., SVM margin/kernel settings, KNN neighborhood size, and tree depth/min-samples limits in DT and the DT-based ensemble) and by reporting performance on a held-out test set averaged across repeated splits.
5. Results and Discussions
This study distinguishes alert versus drowsy states using EEG from the MIT-BIH polysomnography dataset. Signals were segmented into 30 s epochs and labeled as Wake (alert) or N1–N3 (drowsy) as outlined in
Section 3; REM was excluded. In total, a balanced set of 6212 segments was analyzed. From each segment, 61 features were extracted (time-domain linear/nonlinear plus a small set of frequency descriptors).
Data were partitioned with stratified sampling (70% train/30% test) to preserve class balance. To reduce variability from a single split, the procedure was repeated across five independent runs with different seeds, and average performance is reported. The split was performed at the signal level (i.e., subject-dependent), so segments from the same individual could appear in both train and test sets. We trained KNN, SVM, and DT models with hyperparameters tuned via Bayesian optimization. KNN achieved the highest training accuracy (99%) but generalized less effectively (80.4% test). SVM showed more balanced train–test behavior. A DT-based bagging ensemble (EL) yielded the best overall test performance—accuracy 84.7% and F1 84.9%—surpassing single classifiers on accuracy, sensitivity, and F1.
The ensemble’s perfect training accuracy is not a practical advantage; it indicates overfitting driven by dataset size and feature dimensionality. Accordingly, test accuracy (84.7%) is the appropriate indicator of performance, and validation on independent datasets is required to assess generalizability. The ensemble’s strength stems from bagging, which reduces variance and enhances model diversity—useful under noisy, high-dimensional EEG conditions—yet the notable train–test gap (≈100% vs. 80.4–84.7%) confirms some overfitting.
To mitigate overfitting, we (i) compared multiple classifiers (KNN/SVM/DT and the DT-based ensemble), (ii) conducted Bayesian hyperparameter tuning, and (iii) examined normalization configurations and alternative random splits. None surpassed the ensemble on the held-out test sets. The observed gap is attributable to (a) limited inter-class diversity, (b) 61-dimensional feature space, and (c) noise/ambiguity during transitions between alert and drowsy states.
Recommendations: Future work should (1) expand to larger, more diverse EEG cohorts, (2) introduce regularization within the ensemble framework, and (3) consider dimensionality management strategies where appropriate. We also assessed data standardization as a potential remedy, but it did not improve performance. Consequently, the DT-based bagging ensemble was adopted, delivering 100% training accuracy and 84.7% test accuracy. On test data, the ensemble achieved higher Precision, Sensitivity, F1, and MCC, while KNN obtained the highest Specificity. Summary metrics are given in
Table 5, with confusion matrices in
Table 6.
We acknowledge that a subject-dependent split can yield optimistic estimates because individual-specific EEG patterns may appear in both training and test sets. While this design is common in early-stage feasibility studies, subject-independent validation will be required for practical clinical/automotive deployment to ensure generalizability.
Development-time trials. During preliminary experiments we applied 10-fold cross-validation, class rebalancing (under/oversampling), and algorithmic feature selection/dimensionality reduction. None produced consistent or meaningful gains across models, so—for focus and brevity—their detailed results are omitted. Final results are therefore reported under a stratified 70/30 hold-out protocol repeated five times, with the full 61-feature set retained to preserve interpretability.
The pronounced train–test gap, most evident for KNN and DT, indicates potential overfitting that was not substantially reduced by the explored configurations (feature selection, k-fold schemes, or hyperparameter tuning). Among all methods, the ensemble learning (EL) approach showed greater stability, principally due to bootstrap aggregation reducing variance—highlighting the effectiveness of ensembles for noisy, high-dimensional EEG. Future work should investigate stronger regularization, simplified model architectures, and individualized learning to enhance generalization. A limited re-referencing check (C4 to linked mastoids and to a common average) showed no material difference relative to C4–A1; accordingly, we report results for C4–A1.
Ecological Validity and Domain Shift to Driving
Our analysis focuses on EEG-based recognition of wakefulness–drowsiness transitions using clinically annotated labels. The task-agnostic neural signatures employed—elevated theta, reduced beta, transient alpha bursts, and lower entropy/complexity—have been reported across resting, vigilance, and simulator paradigms, supporting the use of clinical EEG to establish an interpretable baseline for drowsiness detection.
Nonetheless, clinical sleep studies and simulator/on-road driving differ in artifact profiles (EOG/EMG/motion), sensory load, and vigilance dynamics. To enable translation, we recommend: (i) channel-matched acquisition with robust artifact mitigation; (ii) subject-independent evaluation on driving-drowsiness datasets (simulator or on-road); and (iii) lightweight domain adaptation (e.g., feature re-centering or covariance alignment) without modifying the core classifier. Such external validation is a necessary next step.
6. Conclusions and Future Work
This study leveraged the MIT-BIH Polysomnography dataset (18 records; >80 h of clinically annotated overnight data) to distinguish alertness vs. drowsiness from single-channel EEG (C4–A1). We extracted 61 handcrafted features spanning time-domain, nonlinear, and frequency descriptors, and evaluated KNN, DT, SVM, and a DT-based ensemble (EL). Consistent with the nonstationary and nonlinear nature of EEG during sleep, the feature set captured relevant dynamics and enabled reliable discrimination: the EL achieved 84.7% test accuracy with F1 = 84.9%, outperforming single classifiers.
The findings confirm that a lightweight, interpretable pipeline can detect vigilance states from clinical EEG with low computational overhead, making it suitable for embedded/edge inference. At the same time, the train–test gap (perfect training accuracy vs. lower test accuracy) indicates overfitting, emphasizing the need for larger, more heterogeneous cohorts to establish generalizability.
Using a clinical PSG database is appropriate for a methodological baseline because (i) drowsiness-related EEG markers—elevated theta/alpha and reduced beta—are physiologically consistent across lab and driving settings, (ii) high-quality annotations provide precise labels, and (iii) such datasets are well-established in prior work (e.g., Chen et al. [
20]; Christensen et al. [
22]). Our aim here was preclinical method development, not direct in-vehicle deployment.
Implications and future work: For real-world translation, priorities include: (1) subject-independent evaluation on simulator/on-road driving datasets; (2) multimodal integration with EOG, EMG, HRV, and respiration to boost robustness; (3) development of artifact-resistant, minimally intrusive sensors (e.g., dry-electrode headbands/ear-EEG) to address motion and comfort; (4) regularization and simplified model architectures, plus personalization to individual baselines; and (5) lightweight domain adaptation (e.g., feature re-centering/covariance alignment) without altering the core classifier. The proposed EEG module should be viewed as complementary to existing camera-based fatigue monitoring to increase resilience to environmental variability.
Recent deep learning systems (e.g., HATNet, IEEE TCYB 2025; ~90–92% test accuracy) illustrate the upper bound in accuracy but often require substantial compute and GPU-supported inference. In contrast, our ensemble of decision trees with engineered features attained 84.7% while providing full interpretability and fast, resource-efficient inference, a practical advantage for embedded automotive platforms.
Finally, single-channel modeling was intentional to support deployment and interpretability. Although multi-channel montages may offer incremental gains (e.g., spatial filtering with CAR/Laplacian), our sensitivity checks suggest the present conclusions do not hinge on a particular reference; systematic multi-channel validation is left for future work.