1. Introduction
Accurate monitoring and classification of sleep stages are critically important because sleep plays a fundamental role in regulating numerous physiological and cognitive processes, as highlighted by [
1]. Sleep activity has been closely linked to immune function, metabolic regulation, memory consolidation, and emotional processing. Therefore, disrupted sleep patterns or misclassifications of sleep stages can hinder the timely identification of sleep-related disorders such as insomnia, sleep apnea, or narcolepsy. As such, the development of reliable computational models for sleep stage classification has significant implications not only for clinical diagnosis and personalized medicine, but also for broader efforts in preventive health and well-being monitoring.
In particular, adults who sleep less than seven hours a night are at greater risk of gaining weight, diabetes, high blood pressure, heart disease, stroke, and depression. However, the amount of sleep is certainly not the only important thing to monitor. According to [
2], quality of sleep is indeed essential. Monitoring sleep stages during the night enables an objective assessment of a person’s sleep quality. While sleeping, we spend different amounts of time in five different sleep stages (American Academy of Sleep Medicine—AASM-guidelines), as extensively described by [
3]:
Wake (W) refers to 5–15% of a healthy adult’s night rest;
Rapid eye movement (REM) refers to 20–25% of a healthy adult’s night rest;
Non-REM light sleep (NREM1) refers to 2–5% of a healthy adult’s night rest;
Non-REM medium sleep (NREM2) refers to 45–55% of a healthy adult’s night rest;
Non-REM deep sleep (NREM3) refers to 10–20% of a healthy adult’s night rest.
Therefore, the physiology of sleep makes it an inherently imbalanced use case for data scientists aiming at the creation of SSC models.
The gold standard for monitoring sleep stages and associated sleep quality is polysomnography (PSG), which consists of electroencephalography (EEG), electromyography (EMG), electrocardiogram (ECG), and electrooculography (EOG) [
4]. The raw PSG data are then manually annotated by professional technicians, who are able to estimate the sleep stage of monitored patients within 30 s epochs. The commonly accepted silver standard is actigraphy (ACG), but over time, its poor reliability and instability have been proven by [
5]. The performance of the ACG was shown to be comparable or poorer than that of other wearable and sport tracking technologies, making the ACG not interesting for sleep stage classification (SSC). PSG is therefore the best and most reliable solution, but it is associated with high resource consumption, high costs, patient discomfort, and limited accuracy of interrater interpretation. Moreover, it is usually performed in specialized laboratories and is thus not viable for longitudinal monitoring [
6].
Even though PSG is extremely reliable, it is expensive, invasive, complex, and not suitable for longitudinal studies. In this regard, the present work aims to answer the following research questions (RQ):
- (RQ1):
Is it possible to develop reliable and efficient modeling solutions able to directly perform SSC through non-invasive sensors able to measure HR and motion?
- (RQ2):
Do imbalance management techniques offer a way for improving achievable results in non-invasive SSC through HR and motion?
The remainder of this work is organized as follows:
Section 2 deepens the literature focused on the two main topics of this work, non-invasive SSC and imbalance management techniques for imbalanced classification data.
Section 3 is dedicated to a detailed presentation of the experiments performed and the dataset used. Furthermore, in
Section 4, we present and discuss the achieved results. The last Section aims at summarizing the obtained results with respect to RQs and comparing our best model’s performances to those of similar works in the current literature. Additionally, we summarize the findings and intuitions developed on the basis of the extensive number of experiments performed.
2. Related Works
Some researchers attempted to automate the annotation task, owing to the development of solutions capable of extracting features from raw ECG signals. They then used these predictors for the classification of sleep stages. In [
7], HR, RR and motion were shown to be suitable for the SSC task. Two training scenarios (subject-specific and subject-independent) were explored. The results achieved are as follows: 51% accuracy in discerning all five sleep stages (5-SSC) and 77% accuracy in 3-SSC (NREM1-2-3 phases condensed into one single NREM). Nonetheless, HR, BR and motion are extracted from raw PSG signals, thus being potentially different from the HR, BR and motion that other less intrusive monitoring technologies could provide. Similarly, ref. [
8] used HR, RR, and movement (number of movements in a 30 s window) for SSC. These signals are extracted and computed from the original PSG signals. Ref. [
9] used HR (directly derived from ECGs) data only for the SSC task. They perform 5-SSC with 66% accuracy, and 72% for 4-SSC (with NREM1 and NREM2 aggregated in a single state). Ref. [
10] presented an open source Python 3.12 package for sleep staging based on heart rate variability that can be extracted from raw ECGs. Recent studies have also explored machine learning approaches for automated sleep analysis using physiological signals, highlighting the growing potential of data-driven techniques for unobtrusive sleep monitoring [
11]. In [
4], a generative adversarial network (GAN) was proposed for the management of class imbalance. The input of the network is the raw EEG signals from the acquired PSG. Nonetheless, the authors are not interested in non-invasive SSC.
2.1. Modern SSC Solutions
PSG weakness, together with advancements in sensing technologies and data analysis and modelling solutions, fostered the adoption of wearable devices [
12] and noncontact sensors to build models capable of accurately estimating sleep stages, without the need for invasive and expensive PSG measurements [
1]. Ref. [
13] employed HR and motion count signals extracted from a wearable device. They compare several DL and ML models and some imbalance management techniques. They used an open dataset, which involved 31 subjects. They perform binary classification only (sleep-wake recognition), and the results are approximately 91–95% accuracy, 63–67% specificity, 94–98% sensitivity. Ref. [
14] solved the 4-SSC task via a Fitbit device (wearable consumer tracker). Two sequential models are developed: the first cleans sleep stages computed by the proprietary Fitbit algorithm; in the case of misclassified sleep stage detection, the second model computes the correct sleep stage from raw data collected from the wearable device. The authors use random upsampling (RUS) and random downsampling (RDS). The effectiveness of under/oversampling has been investigated in another dedicated work [
15]. Ref. [
6] used an EarlySense contactless device. This work is focused on 4 SSC and sleep-wake (SW) recognition methods. Ref. [
5] included in their study 7 consumer devices for sleep stage monitoring (4 wearables, 1 EarlySense, 2 based on the RF wave). A comparison with PSG is performed for both the Epoch-By-Epoch (EBE) 4 SSC and the aggregated sleep summary measures.
For the effective development of non-invasive SSC solutions based on ML and DL, the availability of open data is essential, and the recent review of [
1] lists the datasets available up to 2022. Recently, an open dataset for sleep stage classification was published by [
16] on the PhysioNet platform developed by [
17]. It comprises 100 nights monitored via both the PSG and a wearable device (Empatica E4 bracelet), thus being a suitable reference for building and comparing data-driven modeling and data imbalance management solutions for sleep stage classification via simpler and less invasive monitoring technology. This open dataset (together with other open datasets) was used by [
18], who focused on SW recognition through acceleration signals and self-supervised modeling.
In summary, much attention has been given to sleep monitoring over time; however, several studies have not focused on making it a non-invasive and easier task through the development of algorithms for wearable or noncontact device signals. Studies focused on assessing the reliability of contactless technologies (such as undermattress belts) are usually pilot studies with few patients involved (approximately 5 people). Among previous studies addressing our same objective, namely, developing algorithms based on easily collectable physiological signals for accurate sleep stage classification, and using this framework to evaluate data imbalance management techniques, only a few have considered both multi-class sleep stage recognition beyond the simple sleep–wake (SW) task and the impact of imbalance mitigation strategies [
13].
2.2. Class Imbalance Management Techniques
Sleep stage classification, akin to numerous other tasks in biomedical data analysis, is notably characterized by pronounced class imbalance. This intrinsic characteristic of data necessitates the rigorous assessment of mitigation strategies aimed at addressing data complexity and enhancing the robustness and generalizability of data-driven models in highly imbalanced and heterogeneous settings such as the one selected.
The class imbalance problem is evident in either simple sleep/wake recognition or up to more subtle sleep stage classification tasks focused on REM, NREM1 (also called light sleep), NREM2, NREM3 (also called deep sleep), and wake state recognition. Specifically, a normal human night’s sleep typically consists of 40 to 60% of the time spent in light sleep (NREM 1 and 2), 15 to 25% of REM sleep, 15 to 20% of deep sleep (NREM 3), and 5 to 15% of wakefulness [
19].
There are several domains of ML applications characterized by a strong class imbalance, especially if we focus on biomedical application scenarios. Several strategies have been proposed and adopted over time.
We can summarize the main typologies as follows:
Data-level techniques—data resampling (over- and undersampling) such as the renowned synthetic minority oversampling technique (SMOTE);
Algorithm-level techniques—cost-sensitive learning (CSL), and imbalance robust models, such as ensembles;
Other approaches—hybrid combinations.
Ref. [
20] proposed an overview of imbalanced data for the classification tasks in several domains. The authors distinguish between external solutions (which modify the data, but not the algorithms) and internal solutions (based on algorithms or training strategies that can manage imbalance). This review does not focus on multilabel classification and is mostly focused on binary classification, which is not the only task addressed in our study. A very complete article is the one by [
21], where CSL, resampling, and algorithmic strategies are reviewed. The authors highlight the difficulty of setting the cost in CSL. Unfortunately, in their review discussion, they are only focused on binary classification. However, they delineate several research directions, and some have been explored in the meantime between their review and this paper. In [
22], several resampling techniques (both undersampling and oversampling), and some imbalance-robust models (ensemble boosted, among others) are presented. Ref. [
23] review is focused on multi-class medical tasks, which are characterized by a strong data imbalance. The authors focus on oversampling and provide a list of CSL and ensemble models capable of dealing with imbalanced data. Among the most recent contributions in this field, ref. [
24] proposed a survey on oversampling (OS), undersampling (US), imbalance-robust algorithms, and CSL. This reference highlights the instability of the approaches tested and the need to carefully evaluate the effectiveness of imbalanced data management strategies case by case.
As briefly introduced, several strategies have been proposed to address class imbalance in machine learning. These methods can be broadly categorized into data-level approaches, algorithm-level approaches, and other complementary techniques.
2.2.1. Data-Level Imbalance-Handling Techniques
Ref. [
25] explored undersampling, oversampling, and hybrid-sampling techniques. They compared 6 classifiers trained with 25 datasets, each characterized by different imbalance ratios, but mostly made up of few features and few samples. Moreover, they are focused only on binary classification. The effectiveness of SMOTE for a Parkinson’s disease monitoring solution was tested in [
26]. Ref. [
27] described the adaptive synthetic (ADASYN) oversampling technique as an improvement over SMOTE. In fact, ADASYN creates instances in the interior of the minority class, using the weighted distribution for the minority classes instead of creating completely synthetic samples. The master’s thesis by [
28] focused only on binary classification. It deepens non-medical use cases by testing logistic regression (LR), support vector machines (SVM), random forest (RF) models with SMOTE or ADASYN or nothing. Their conclusions show that there is no preprocessing method that consistently improves the performance of trained models. In [
29], a new oversampling technique is introduced: the Mahalanobis distance-based oversampling technique (MDO). The authors compare it with SMOTE and ADASYN via 20 multiclass datasets (3–26 classes each, 100–20 k samples each). MDO proved to be the best solution; however, this methodology has not received attention in the literature, which is why we decided to test the most commonly used ADASYN strategy.
2.2.2. Algorithm-Level Imbalance-Handling Techniques
Ref. [
30] demonstrated the effectiveness of CSL: based on the experiments performed, the authors conclude that the reliability of CSL and resampling depends on the specific characteristics of the dataset and the application context. Ref. [
31] demonstrated that four different ML models reach better performance with the CSL configuration using four different medical datasets. Ref. [
32] reviewed CSL for medical data; however, resampling was not considered in their work, which highlights the importance of data and code sharing for this specific research topic. Furthermore, ref. [
33] defined a new CSL method and validated its efficiency.
Analyzing hybrid techniques, the effectiveness of CSL coupled with oversampling to achieve better results has been demonstrated for bankruptcy modeling by [
34]. Ref. [
35] also used oversampling together with CSL on 4 non-biomedical datasets. The combination of these two approaches achieves better performance than both used separately. Ref. [
36] tested single and ensemble classifiers, and ADASYN coupled with CSL. Nevertheless, they performed the experiments only on a proprietary dataset (regarding the freezing of gait in Parkinson’s patients). Ref. [
37] investigated the effects of feature selection in conjunction with oversampling and CSL. The experiments use six binary classification biomedical datasets characterized by extremely high dimensionality, which are not comparable to our modeling scenario.
2.2.3. Other Imbalance-Handling Techniques
A work proposing a novel approach for binary classification is [
38], in which the proposed methodology is validated using 11 datasets. Ref. [
39] demonstrated overfitting and noise resulting from the application of data-level approaches. Ref. [
40] explored whether class imbalance (whose hindering effects on ML modeling are proven) also has a negative influence on DL models (multilayer perceptron and convolutional neural network). They used several datasets, some belonging to the biomedical domain, to test the hypothesis. They conclude that both MLP and CNN suffer from data imbalance, especially when the available dataset size is limited. In particular, MLP is highly affected, whereas the CNN is slightly less affected. Ref. [
41] focused on AI-based approaches for imbalanced datasets. In terms of modern AI-based methods, GANs are unstable and not reliable if used with limited data. However, resampling techniques such as SMOTE and its variations, or ADASYN, proved to be as efficient as being limited and inadequate, depending on the different use cases.
Over time, different solutions have been proposed to address imbalanced data in binary and multiclass applications, but a clear and unified strategy for successfully accomplishing this objective is lacking. Ref. [
42] highlighted how a common and clear usage of resampling techniques in imbalanced datasets is missing. The datasets are sampled by the authors to create different datasets with different imbalance ratios and sizes. The normal 70/30% train/test split is used for each experiment. The authors conclude that SMOTE and ADASYN are similar if the amount of data is not large and if the imbalance ratio is high. Their focus is on binary classification only, but on the basis of the reviewed literature, we can deduce how much this holds for multiclass applications as well. Ref. [
41] reviewed the literature on oversampling methodologies, summarizing the efficiency and limitations of different approaches. The authors highlight the need for tailored strategies in specific use cases, in line with the majority of academic works.
Based on the reviewed literature, we believe that the implementation of experiments based on ML modeling without considering more complex approaches based on data-consuming DL techniques, such as GANs, should be proven in the context of SSC. This will provide a clearer view of this specific use case in terms of minimally invasive SSC solutions obtained, considering imbalanced data management techniques.
The landscape emerging from this extended review on the management of imbalanced classification tasks is diverse. Several specific application domains have been explored over time, but most of the conclusions drawn by researchers highlight the dependency on specific characteristics of data, the instability of advanced DL-based solutions, and the consequent potential for additional research on this topic. Most of the reviewed literature agrees with the use-case specificity when evaluating the effectiveness and reliability of data-level resampling solutions and model-level cost-sensitive solutions.
3. Materials and Methods
As highlighted in the previous section, several strategies, focused on different conceptual levels of ML modeling, can be implemented to address imbalanced data classification. From doing nothing at all, given that imbalance is common in many practical use cases, to customizing the cost matrix used during model training, or oversampling minority classes, or selecting imbalance-robust supervised models.
In this work, we specify 32 different scenarios on the basis of the concatenation of class-imbalance management strategies in the face of different SSC tasks. Specifically, we are going to address sleep-wake (SW) recognition, 3-SSC (wake vs. REM vs. NREM), 4-SSC (wake vs. NREM1-2 vs. NREM3 vs. REM), and 5-SSC (all the sleep stages labeled by sleep technicians). The labeling of the five sleep stages is aligned with American Academy of Sleep Medicine (AASM) guidelines; nonetheless, the imbalance associated with the other tasks is peculiar and different from that associated with 5-SSL. Hence, we also present the results achieved for the other classification tasks.
We tested several strategies (each called a scenario), both simple and hybrid, in each of the aforementioned tasks. The extensive description of the experiments performed allows us to evaluate evidence of the effectiveness in dealing with the SSC use case.
We used Python 3.9 for data import and preparation, together with MATLAB 2024b, to facilitate the model training and comparison phases. Hereafter, we are going to present the data used for the experiments.
3.1. Dataset
The dataset we used for the experiments is the “Dataset for Real-time sleep stage EstimAtion using Multisensor Wearable Technology (DREAMT)” [
16], which is available on the PhysioNet platform [
17]. Data are accessible upon registration and signing of a data use agreement, thus making it an open access dataset. It is a novel (2024) and quite extensive repository, where the Empatica E4 wearable device (validated biomedical device, [
43]) has been coupled with invasive PSG sensors to collect multiple signals, and the associated sleep stage is accurately estimated by sleep technicians on the basis of a PSG every 30 s window. Data were acquired from 100 subjects, both healthy and with any disease (mainly obstructive apnea and obesity). The data acquisition protocol is fixed for all participants.
For every participant’s night, Empatica E4 collects the following:
The timestamp [s] (64 Hz);
Blood volume pulse (BVP) derived from a photoplethysmography (PPG) sensor (64 Hz);
Interbeat interval (IBI) [ms] derived from the PPG (64 Hz);
Electrodermal activity (EDA) [S] of the galvanic skin response sensor (4 Hz);
Skin temperature [°C] from the infrared thermopile sensor (4 Hz);
Triaxial accelerometry (32 Hz);
Heart rate (HR) [bpm] estimated from the BVP signal (1 Hz).
Owing to the synchronization of the acquisition systems, these data are aligned with those associated with the PSG to allow sleep stage annotation for each 30-s window.
Additional information about the dataset can be found in the original reference. The details reported here provide essential information to be noted in order to understand the process followed in this study. In this study, the dataset split was performed at the epoch level, meaning that samples from the same subject may appear in both the training and testing sets. Although subject-independent evaluation protocols (e.g., LOSO) are commonly adopted, their application in this dataset is challenging due to severe class imbalance and the absence of specific sleep stages in some subjects, which may lead to unstable evaluation conditions.
3.2. Preprocessing
We used the aforementioned dataset to create an analytical model that is able to estimate the correct sleep stage on the basis of HR and motion only, given that these physiological signals can be reliably acquired using contactless sleep monitoring technologies, such as under-mattress sensing systems. This choice was made to emulate the type of information typically available in unobtrusive long-term sleep monitoring scenarios.
The first preprocessing step is computing the magnitude of the three acceleration signals. Moreover, we filtered out the “preparation” and “missing” stages annotated, given their lack of consistency with our main aim, i.e., the development of a reliable SSC algorithm.
Feature Extraction: For all the remaining windows, we extracted some relevant and commonly adopted statistical time-domain features [
44,
45] from both the HR sequence and the motion signal computed. The extracted features for both HR and motion are those listed in
Table 1:
The resulting dataset available for successive experiments is composed of 80,091 samples (overall number of 30-s windows associated with relevant sleep stages) of the 24 extracted features (12 extracted from the HR signal, and 12 extracted from the motion signal).
The minority degree of minority classes depends on the different SSC tasks we address among SW, 3-SSC, 4-SSC, and 5-SSC. The number of samples for each sleep stage associated with every SSC task is summarized in
Table 2.
This dataset has been randomly split into 80/20% training/testing for the baseline scenarios, and into 60/40% training/testing for the data resampling-based scenarios. This different proportion is related to the fact that oversampling synthesizes minority class samples, thus increasing the amount of data available for model training, while the testing amount remains fixed. Hence, retaining more data for testing is fair for rebalancing-based scenarios.
Adaptive synthetic oversampling (ADASYN): Concerning oversampling, we reviewed the literature and identified several traditional and innovative methodologies. The effectiveness and reliability of most of them have been proven in specific scenarios only, which is why we decided to test one modern yet renowned approach, the adaptive synthetic (ADASYN) approach. It is based on generating new minority class samples on the basis of local density. With respect to the most cited and used SMOTE, ADASYN is better at emphasizing instances that are hardly classifiable. When oversampling, it is important to choose the best balancing ratio. The optimization of the ratio is highly specific, and no academic reference highlights a quantitative way to set the best rebalancing weight. For this reason, we defined different scenarios, each characterized by a specific percentage (either 25, 50 or 100%) of the gap between each minority class and the majority class to be synthesized. Practically speaking, if the gap between the NREM1 class and the majority class (i.e., NREM2) is 20,000, in the ADA25 scenario 5000 NREM1 samples have been synthesized; in the ADA50 scenario, 10,000 NREM1 samples are generated; and in the ADA100 scenario the numerical gap between each minority class and the majority class has been filled (this is the case for absolutely balanced datasets resulting from the oversampling phase). Therefore, the selected ratios (25%, 50%, and 100%) represent increasing levels of re-balancing, allowing us to analyze the effect of mild, moderate, and full oversampling on model performance.
Customization of the misclassification cost matrix (CSL): Another strategy we explored is the CSL. The misclassification costs were defined by considering the class distribution observed in the dataset. In particular, higher penalties were assigned to errors involving minority classes in proportion to their relative frequency, so that the resulting cost matrices reflect the imbalance ratio among sleep stages rather than relying solely on expert-defined values. Specifically, we adopted the basic approach, which attributes the highest cost (an integer number) to the first minority class, the lower integer to the second minority class, and so on up to the majority class’s cost, which is 1. The misclassification cost matrices used in all the _CSL scenarios for every task are presented in
Table 3,
Table 4,
Table 5 and
Table 6.
3.3. Experimental Scenarios
The summary of experimental scenarios defined with respect to imbalance management techniques applied is presented hereafter. In the experiments, a compact notation is adopted to describe the different scenarios under evaluation. The term base refers to the use of the original, unbalanced dataset; the term CSL indicates that a cost matrix was applied during training in that scenario; the term ADAx indicates the use of ADASYN oversampling with x being the ratio of the gap filled between each minority class and the majority class in the training set (x being either 25, 50 or 100%). So, for example, in the scenario ADA25_CSL, ADASYN was applied to the training data, filling a quarter of the gap between each minority class and the majority class, applying then a cost matrix during training.
Specifically, for each SSC task addressed, we experiment with different strategies, as summarized below:
baseline—use normalized features extracted from the original data (80% training and 20% testing);
base_CSL—use normalized features extracted from the original data + customized misclassification cost matrix during model training (80% training and 20% testing);
ADA100—use ADASYN oversampled normalized features (on the 60% used for training) up to a completely balanced dataset, and test the resulting models on 40% of the original data taken apart for testing;
ADA50—use ADASYN oversampled normalized features (on the 60% used for training) filling half the gap between each minority class and the majority class, and test the resulting models on 40% of the original data taken apart for testing;
ADA25—use ADASYN oversampled normalized features filling a quarter of the gap between each minority class and the majority class, and test the resulting models on 40% of the original data taken apart for testing;
ADA100_CSL—as ADA100, but also setting a customized misclassification cost, and then testing the resulting models on 40% of the original data taken apart;
ADA50_CSL—as ADA50, but also setting customized misclassification costs, and then testing the resulting models on 40% of the original data taken apart;
ADA25_CSL—as ADA25, but also setting customized misclassification costs, and then testing the resulting models on 40% of the original data taken apart.
Within each of the experimental scenarios presented, four different models have been trained and compared with each other: decision tree (DT), k-nearest neighbour (KNN), ensemble (ENS), and deep artificial neural network (ANN). With respect to the models’ hyperparameters, every experiment trained the optimized version by using 30 trials led by the Bayesian optimizer.
Owing to the proposed experimental setup, we are able to draw conclusions about several topics:
compare the effectiveness and reliability of different imbalanced data management techniques (with/without CSL, with ADASYN at varying degrees of synthesis) for every SSC task and hence under different imbalance ratios;
compare different models such as DT, KNN, and ANN, as long as the imbalance-robust ENS, under every scenario and every SSC task.
4. Results and Discussion
4.1. Performance Metrics
In line with the literature reviewed, we decided to include, together with well-known classification metrics computed for each class, such as accuracy, precision, sensitivity, specificity, and F1 score [
46], other metrics that proved to be robust against unbalanced classification. Specifically, we propose both class-specific metrics, such as specificity, sensitivity, precision, F1 score, geometric mean and Matthew’s correlation coefficient (MCC) [
47], and overall metrics, like imbalance accuracy metric (IAM) proposed by [
48] and micro accuracy. Both the MCC and IAM range from −1 to 1, and values close to 1 are desirable, as they mean better classification performance. Both metrics are more suitable with respect to the previously mentioned metrics in cases such as the one under analysis in this paper, i.e., imbalanced multiclass classification. MCC includes all the values of the confusion matrix in its formula, and it is sensitive to imbalance; IAM identifies asymmetric errors and can be adopted in unbalanced multiclass scenarios.
Table 7 summarizes the detailed descriptions of the metrics, the associated formulas, and the category to which each belongs.
4.2. Experimental Setup
In this work, several experiments under diverse modeling scenarios, addressing different SSC tasks, were performed. Specifically, we performed experiments embracing three main variability dimensions: different models, different data-level techniques (scenarios), and different SSC tasks. The presentation of all the results could be overwhelming for the reader. For this reason, visually powerful bar charts and bump charts are presented hereafter. Extensive tables containing all the detailed numerical results can be found in the
Appendix A, while the best results are going to be described in the text.
The bump charts highlight the ranking of different modeling scenario metrics, with respect to every class (horizontal axis) for every model trained (groups on the horizontal axis) in every SSC task addressed (one chart for each task). In this work, we consider performance metrics where the higher the value is, the better it is; we assign the 1st rank to the highest value (best performance), and the 8th rank to the lowest value.
On the other hand, bar charts highlight the magnitude of the value and the difference among the facts represented. It is easier to assess the effectiveness of different scenarios and models on the basis of the bump charts and bar charts presented below.
4.3. Comparative Analysis
To obtain a comprehensive view of the experiments performed, we present bump charts of
score (
Figure 1),
(
Figure 2), and
(
Figure 3) for every task addressed, with the aim of understanding whether one “best model” can be elected.
These bump charts allow us to deduce some general evidence:
There is high variability in the results achieved under different SSC tasks; only the KNN model seems to have slightly more coherent behavior in the different classes (majority one and minority classes) for all three considered metrics.
The scores and metrics are very coherent with each other as the number of classes increases.
Different models appear to be differently affected by the imbalance management technique (oversampling and/or CSL); in simpler tasks (binary or ternary), ANN and DT are similar to each other, whereas in more complex tasks (4-SSC and 5-SSC) every model benefits from different modeling scenarios in a completely different way.
Considering the previous results shown, the following plots provide a clearer focus on some specific points of interest.
Figure 4 and
Figure 5 represent the IAM and micro accuracy values, respectively, for every model and scenario under the different tasks considered.
These allow us to deduce even deeper evidence in order to answer the RQs set:
The more classes we try to detect, the worse the overall performance we obtain, as expected;
Under every modeling scenario and classification task, the ensemble is usually the best-performing model;
The superiority of the ENS over the DT, KNN classifier and ANN, is stable even in more complex multiclass tasks;
By focusing on ENS, the base scenario together with the base_CSL and ADA_25 are usually the best scenarios.
In general, every model usually performs best under the base, base_CSL and ADA_25 scenarios, suggesting that high weights for oversampling hamper the model performance of trained models.
4.4. Ensemble Model Analysis
To clarify the ensemble’s superiority and dive deeper into the insights sketched before, we present heatmaps of the models’
metric (in green and bold the top values, in red lowest values) on the basis of scenario over model crossing for every class within each task addressed.
Figure 6,
Figure 7,
Figure 8 and
Figure 9 show detailed results for each task addressed. The
metric proved to be the best metric in the case of imbalanced data, such as the use case considered in this study.
From the heatmaps, by looking at greener horizontal lines, it is evident again how much the ensemble model is the best performing model against imbalanced data classification under almost any imbalance management technique (either oversampling or/and CSL) for every class (either minority or majority).
Moreover, the vertical greener columns reveal how much the baseline and the baseline_CSL scenarios are generally the best ones in terms of the MCC, followed only by the ADA25 scenario, i.e., the scenario where a small number of minority classes’ samples are synthesized. Nonetheless, the bold values (the best MCC for every class) range in cells linked to either ADA100, ADA100_CSL, ADA25, base, or base_CSL.
In summary, the ensemble is always the best model (coupled with variable scenarios) except for the following:
REM (minority class in the 3-SSC task), where KNN is the best model;
NREM1 (third minority class in the 5-SSC task), where KNN is the best model;
REM (second minority class in the 5-SSC task), where KNN is the best model.
With respect to the scenarios, the landscape is more diverse:
The base is the best (coupled with either Ensemble or KNN) in both classes for the SW task, in the minority class (REM) for the 3-SSC task, and for the REM, NREM1 and wake classes in the 5-SSC task;
The base_CSL is the best (coupled with either Ensemble or KNN) in both classes for the SW task, in the minority class (REM) for the 3-SSC task, in the REM and wake classes for the 4-SSC task, and in the NREM2 and NREM3 classes for the 5-SSC task;
The ADA100 is the best (coupled with the ensemble only) in the NREM and wake classes for the 3-SSC task, in the NREM12 (majority) class for the 4-SSC task, and for the NREM2 (majority) class for the 5-SSC task;
The ADA25 is the best (coupled with the ensemble) in the NREM3 (minority) class for the 4-SSC task.
Hybrid scenarios, where we coupled oversampling with CSL, are preferable if we consider the other performance metrics (GM and F1), which can be screened in the appendix tables.
The ensembles’ configuration set by the Bayesian optimization is usually with bagged trees. Only the following other methods have been fitted for some scenarios:
Adaptive Logistic Boosting is the optimal ensemble method for the SW task—base scenario;
Adaptive Boosting is the optimal ensemble method for the 3-SSC task—base_CSL and ADA100_CSL, 4-SSC task—ADA25, and 5-SSC task—ADA50 scenarios;
Gentle Adaptive Boosting is the optimal ensemble method for the SW task—ADA100, ADA50, ADA25, ADA100_CSL, ADA50_CSL, and ADA25_CSL;
Random Undersampling Boosting is the optimal ensemble method for the 3-SSC task—ADA50_CSL and ADA25_CSL, 4-SSC task—ADA50_CSL, and 5-SSC task—ADA100_CSL and ADA50_CSL.
In order to provide additional performance metrics for the best modeling scenarios outlined, we present the ROC graphs for the best model in each of the 32 experimental scenarios in
Appendix B (
Figure A1 and
Figure A2).
To further support the comparison among classifiers based on the global performance metrics reported in
Appendix A, we conducted a Wilcoxon signed-rank test across the 32 experimental scenarios. The results of this statistical analysis are summarized in
Table 8, which reports the
p-values for the pairwise comparisons between the ensemble model and the other evaluated classifiers (ANN, DT, and KNN), considering both macro accuracy and IAM. As shown in
Table 8, the ensemble model achieved statistically significant improvements over the other models in all comparisons (all
p-values < 0.001).
4.5. Model Explainability Analysis
To provide additional insights into model behavior and support the interpretability of the proposed solutions, a global SHAP (SHapley Additive exPlanations) analysis was conducted on the overall best-performing models identified for each SSC task.
Global SHAP values were computed on a representative subset of the test set (500 samples) using the interventional approach. Shapley values were estimated via Monte Carlo sampling (500 samples) with a maximum of 128 feature subsets. Feature importance was quantified as the mean absolute SHAP value aggregated across all classes, enabling a global interpretation of model behavior.
Regarding model selection, for the 2SSC task, the optimized ensemble model under the base_CSL scenario clearly provides the best overall performance. For the 3SSC task, no single model dominates across all classes; however, the optimized ensemble under the ADA100 scenario offers the most balanced performance, particularly for NREM and wake stages, while the KNN model performs better for REM. Considering the typical trade-off between majority and minority classes, the SHAP analysis is conducted on the optimized ensemble under the ADA100 scenario. For the 4SSC task, the optimized ensemble under the base_CSL scenario achieves the best or near-best performance across all classes, particularly for REM and wake stages, while remaining competitive for NREM12 and NREM3. Similarly, for the 5SSC task, the optimized ensemble under the base_CSL scenario represents the most reliable overall solution. Although it is not the top-performing model for REM and NREM1 (where KNN performs better), it provides superior or near-optimal performance for the remaining classes, making it the most balanced choice.
The resulting SHAP-based feature importance analyses for each SSC task are presented below.
For the 2SSC task (
Figure 10a), the SHAP analysis shows that model predictions are mainly driven by variability- and range-based features derived from both motion (e.g., stdM, p2pM, rmsM) and heart rate (e.g., stdHR, p2pHR). Mean-based descriptors are not among the most relevant predictors, indicating that the discrimination between sleep and wake states primarily relies on dynamic fluctuations rather than on average signal levels.
For the 3SSC task (
Figure 10b), the same variability-driven behavior is observed, with both motion and heart rate features contributing significantly. However, distributional descriptors (e.g., quartiles and kurtosis) begin to emerge among the relevant predictors, and feature importance becomes more evenly distributed, reflecting the increased complexity of multi-stage classification.
For the 4SSC task (
Figure 10c), variability- and range-based features remain dominant, while distributional indicators (e.g., quartiles and median-related statistics) play a more structured role in capturing finer differences among sleep stages. In this setting, feature contributions are not uniformly shared across classes: some sleep stages (e.g., wake or NREM2) are strongly associated with the most influential predictors, whereas others are characterized by more diffuse and less distinctive patterns.
For the 5SSC task (
Figure 10d), this trend is further emphasized. Although variability-driven features still provide the largest contributions, feature importance is more widely distributed and increasingly class-dependent. Certain stages are clearly associated with the dominant predictors, while others are harder to distinguish and rely on a combination of lower-impact features, confirming the higher intrinsic difficulty of the task.
Overall, the SHAP analysis reveals a consistent progression across SSC tasks: variability-based features dominate in simpler settings, while distributional descriptors and a more distributed importance structure become increasingly relevant as task complexity grows. These findings suggest that reliable sleep stage classification with minimal and non-invasive signals is feasible, provided that their dynamic and statistical properties are adequately exploited, in line with the physiological complexity of sleep. This behavior is evident from the class-wise SHAP contributions, where dominant features show unbalanced relevance across sleep stages.
5. Conclusions
This study contributes to the field of data analytics and cognitive computing by investigating the feasibility of sleep stage classification (SSC), which can be achieved using a minimal and non-invasive set of physiological signals—heart rate and motion—combined with imbalance-aware machine learning strategies (RQ1). Through a systematic evaluation of 32 experimental scenarios across multiple classification granularities and model families on the PhysioNet DREAMT dataset, we show that ensemble-based approaches provide robust and consistent performance even under severe class imbalance conditions.
In particular, the ensemble model generally outperformed the other considered classifiers across most of the evaluated scenarios. This confirms the ensemble’s superior capability in handling complex and imbalanced multiclass settings.
Our experiments also reveal that the effectiveness of imbalance management strategies depends on the intensity of oversampling and its interaction with algorithm-level techniques. Moderate configurations, such as ADA25, and baseline configurations combined with cost-sensitive learning (base_CSL), tended to provide better performance across several tasks. Conversely, aggressive oversampling strategies (e.g, ADA100) did not consistently improve classification performance and, in some cases, led to reduced effectiveness. This suggests that excessive synthetic sample generation might negatively affect model generalization.
From a data-driven perspective, these results highlight the effectiveness of combining data-level and algorithm-level imbalance management techniques to enhance model reliability in real-world scenarios characterized by skewed class distributions (RQ2). The observed stability of ensemble models across different tasks suggests their potential suitability for cognitive computing applications, where decision robustness and generalization are critical requirements.
A further key insight emerges from the SHAP-based explainability analysis, which reveals a progressive shift in feature relevance across SSC tasks: while variability- and range-based descriptors dominate simpler settings, more complex classifications increasingly rely on distributional features and a more distributed, class-dependent contribution of predictors. This behavior reflects the intrinsic physiological complexity of sleep and highlights the importance of capturing subtle signal dynamics when moving towards finer-grained sleep stage discrimination.
Beyond methodological aspects, the proposed framework supports scalable and computationally efficient analytics for non-invasive sleep monitoring. The reliance on easily collectible signals enables long-term, continuous data acquisition and integration into large-scale monitoring pipelines (RQ1), aligning with big data paradigms and distributed cognitive systems. Such characteristics are particularly relevant for applications requiring adaptive and personalized insights derived from heterogeneous physiological data streams.
Finally, this work emphasizes the importance of adopting imbalance-aware evaluation metrics, such as the Matthews Correlation Coefficient, to avoid misleading conclusions based solely on global accuracy. Although validated in the context of SSC, the proposed experimental framework and methodological insights are broadly applicable to other cognitive computing and biomedical data analytics tasks affected by class imbalance.
Despite the promising results, some limitations should be addressed. The results depend on the selected feature extraction pipeline and modelling configurations adopted in this study. Although multiple models and imbalance management strategies were systematically evaluated, alternative feature representations or hyperparameter settings could lead to different performance outcomes. Furthermore, the experiments were conducted on a single publicly available dataset (PhysioNet DREAMT), which may limit the generalizability of the findings to other populations, sensor configurations, or acquisition settings. However, the use of a publicly available dataset and clearly defined experimental scenarios supports their reproducibility. Future work will include further validation on additional datasets and cross-dataset evaluations to fully assess the generalization capabilities of the proposed methodology. Additionally, the proposed framework could be extended by investigating imbalance-aware training strategies within deep learning architectures, such as CNN or LSTM models applied to raw wearable signals. Future research will also investigate subject-independent evaluation protocols to further assess the generalizability of the proposed approach, particularly in the presence of highly imbalanced sleep stage distributions.