1. Introduction
Sleep analysis plays a crucial role in understanding physiological patterns and diagnosing sleep disorders. Physiological signals, such as heart rate (HR), activity (ACC), body temperature, or electrodermal activity (EDA), provide valuable insights into sleep stages and overall sleep quality. These sleep stages are graphically represented in a hypnogram, which illustrates the different stages of sleep—wakefulness, light sleep (N1 and N2), deep sleep (N3), and REM sleep—over the course of a sleep period, showing transitions and durations for each stage.
Accurately labeling hypnograms is an important challenge that remains significant in sleep research and clinical practice. In fact, the reliability of the widely used American Academy of Sleep Medicine (AASM) guidelines for sleep staging has been questioned due to substantial variability in agreement between scorers [
1,
2,
3]. Studies show that Kappa values for inter-rater agreement on sleep stages vary widely, with particularly low agreement for the N1 stage (k = 0.241, 0.412). Furthermore, when scorers are from different medical centers, the Kappa value may be 0.58, but it can increase to 0.78 when they are from the same center [
2]. To ensure certainty, the AASM recommends that ideally the sleep stage scoring should be performed by three different experts [
4]. However, due to resource constraints, this rigorous labeling process may not always be feasible in practice. This, together with the non-validation of the data, can lead researchers to wrong conclusions without any clue of why this happens. This is particularly concerning in machine learning research, where data is gathered without verification, and the interpretability of results is often overlooked.
To address these challenges, several automated sleep staging tools have been developed. These tools aim to improve the consistency and efficiency of sleep stage labeling by leveraging machine learning algorithms and physiological signals. For example, Stephansen et al. [
5] developed an automated sleep staging system that achieved human-level performance on the MASS dataset. Similarly, Fiorillo et al. [
6,
7] proposed a deep learning-based approach for sleep staging using EEG signals that outperformed traditional machine learning methods. Moreover, YASA [
8] analyzed EEG, EOG, and EMG signals to perform sleep stage scoring, extracting sleep-related metrics, such as sleep efficiency or total sleep time. While these tools show promise, they still require careful verification to ensure their reliability and accuracy.
Some methods go beyond sleep-stage labeling by assessing confidence or uncertainty, typically using EEG, EOG, or EMG signals from PSG recordings. For example, aSAGA [
9] is an automated sleep-staging model that identifies ambiguous regions in the hypnogram for expert review. U-PASS [
10] incorporates uncertainty estimation to flag low-confidence epochs, allowing experts to review them and improve overall annotation reliability.
In contrast, SOUP provides a novel verification framework that quantitatively evaluates the alignment between hemodynamic signals and expected physiological patterns for each sleep stage. While other tools focus primarily on signal quality and stage consistency, SOUP adds a complementary perspective, providing an additional layer of data processing that directly assesses the reliability of hypnogram labels. It enables both single-patient and cohort-level assessments, allowing systematic detection of anomalous or inconsistent sleep stage annotations across datasets, as well as identifying potential physiological anomalies or health issues in individual patients. Clinicians and researchers can use this feedback to validate hypnograms, identify sleep stages requiring further review, and ensure higher-quality sleep stage data.
Designed for scalability, the tool allows non-programmers to easily incorporate new conditions or integrate additional biomedical signals for specific research or clinical needs. Its versatility enables application at both the individual patient level—supporting daily monitoring—and the cohort level—allowing large-scale sleep stage quality assessments and uncovering systematic inconsistencies across datasets.
The remainder of this paper is organized as follows:
Section 2 begins with an overview of current research on patterns of relevant hemodynamical signals across sleep stages, followed by a presentation of the verification tool’s framework and workflow.
Section 3 details the results from applying the tool to the MESA dataset, and
Section 4 discusses the implications of these findings and outlines potential future directions for this research. Finally,
Section 5 provides the main conclusions of the study.
3. Results
SOUP can be used for the analysis of clinical datasets and the re-evaluation of existing data for a complete analysis, including the possibility of making new diagnoses. Additionally, it is a valuable tool for daily patient management and data analysis. In this section, both use cases will be demonstrated using the MESA dataset (ID 5003) [
26], with access granted by the National Sleep Research Resource [
27]. The objective is not to provide a new diagnosis, but to showcase the functionalities of the tool, aiming to detect any deviations or error trends in the labeling and analyze patient cohorts.
Finding a dataset that includes hypnograms, ECG, and activity data can be challenging, which makes the comprehensive nature of the MESA dataset particularly noteworthy. MESA dataset is a longitudinal study on cardiovascular disease and its risk factors, comprising data from 6.814 participants. Among them, 2.237 underwent a Sleep Exam with polysomnography, actigraphy (ACC), ECG monitoring, and a sleep questionnaire. After excluding participants with unsynchronized activity, missing values, or non-monotonic R points, the final cohort consisted of 1.308 patients (55.20% women, 44.80% men), aged 54–94 years (69 ± 9). Moreover, for clinical dissemination, only sleep-related diseases (sleep apnea, insomnia, and restless legs) have been considered. Those without sleep problems were considered healthy patients even when they may suffer from other health issues. Minimal processing was applied, primarily utilizing the pyHRV package (version 0.4.1) [
28] to extract ECG temporal and frequency components.
This dataset serves as an ideal validation platform for testing the tool’s capacity in analyzing biomedical signals (HR and ACC) and hypnograms across different populations. This dataset effectively showcases the practical applications and effectiveness of the developed tool.
3.1. Single Patient Analysis
To evaluate the tool’s functionality for single-patient analysis, a 65-year-old white woman (ID 1102) was randomly selected. To simulate the feedback for healthcare professionals, the patient’s hypnogram was deliberately modified based on interrater disagreement ratios reported by Deng et al. [
29]: N2–N3 (22.09%), Wake-N1 (19.68%), and N1-N2 (18.75%). Accordingly, a proportion of epochs in the original hypnogram were randomly reclassified, mimicking typical interrater variability in sleep stage annotations: 22.09% of N2 to N3, 19.68% of Wake to N1, and 18.75% of N1 to N2. These modifications allow for assessing how discrepancies in sleep stage labeling influence the fulfillment of expected physiological conditions.
Table 2 presents the fulfillment percentages for the three signals in the modified hypnogram: Activity (ACC), Heart Rate in the time domain (HR Time), and Heart Rate in the frequency domain (HR Freq). Conditions associated with the N1 stage for ACC are entirely unmet (0%), whereas those for NREM and Awake are largely satisfied. In contrast, HR Time and HR Freq conditions generally demonstrate moderate to high fulfillment rates across sleep stages.
At this stage, a healthcare professional might question why no N1-specific condition is met, prompting further investigation into potential misclassifications. This is particularly relevant, as N1 is one of the most challenging sleep stages to accurately classify.
A detailed analysis of the ACC conditions in
Table 3 provides insights into when each condition is met. The results suggest that awake movement patterns differ from those in N1, and the highest movement levels during sleep do not occur in the N1 stage. This discrepancy may stem from the 18.75% of N1 epochs misclassified as N2, potentially disrupting expected movement patterns.
In practice, these outputs serve as a warning for healthcare professionals to review and, if necessary, correct the N1 stage annotations. In this scenario, restoring the original labels does not impact the HR-related conditions but significantly improves the ACC conditions fulfillment.
Table 4 presents the status of the compliance of ACC conditions after the hypnogram correction. In particular, the table illustrates that all conditions are now consistently met on average.
This analysis shows how the tool can identify discrepancies in sleep stage classification, providing healthcare professionals with insights to refine annotations and improve its accuracy.
3.2. Full Dataset Analysis
In this section, an analysis of the entire dataset is performed to demonstrate how to interpret the textual outputs and graphs generated by the tool. For each signal within the dataset, various produced outputs are presented and explained to provide a comprehensive understanding of all the output information.
3.2.1. Activity (ACC)
To analyze the ACC signal,
Table 5 shows a partial summary of scores by patient, generated as a textual output by the tool. The data reveal an agreement score of 84.75% for the conditions. This demonstrates that the studied database generally aligns with state-of-the-art conditions.
To analyze the specific conditions fulfilled,
Figure 2 presents the heatmap displaying the fulfillment of conditions. Remarkably, the graph shows that four of the conditions have a high acceptance rate, while two conditions show some disagreement: [ACC] Awake ≈ N1 and [ACC] N1 > All Stages.
These findings, supported by the textual output, indicate that the two conflicting conditions are fulfilled in at least 70% of the patients, while the remaining conditions are satisfied by at least 80%. This demonstrates a strong overall agreement of the MESA dataset with the ACC conditions.
In addition, the clustering of patients who meet the same conditions is performed. In this analysis, the framework identifies 18 distinct clusters, with
Table 6 presenting a partial output of the clustering dissemination based on clinical information. The largest cluster consists of 838 patients who fully meet all conditions, reflecting the dataset’s overall trend. Within this group, 75.30% are healthy, 57.88% are female, and the average age is 70 years. In contrast, the smallest cluster includes only three individuals, where all conditions are met except for a discrepancy in [ACC] NREM < (frequent) Awake. However, a review of their clinical details revealed no clear pattern that explains this common misalignment. Additionally, the framework identifies a distinct subgroup, predominantly composed of women (8 women and 1 man), where only two conditions are met: [ACC] NREM < (frequent) Awake and [ACC] Higher Before/After REM. These patients are mostly healthy, with an average age of 72 ± 11 years, suggesting a subgroup of women whose ACC conditions deviate from the expected patterns. Further clinical analysis may help uncover underlying factors contributing to this variation, offering deeper insights into potential physiological or clinical differences within this group.
This analysis of the ACC signal shows a strong alignment of the database with state-of-the-art conditions in general terms and across all the studied sleep stages.
3.2.2. HR—Time Domain
Considering HR with the time domain, the overall score is 73.90%. Consequently, the dataset meets the state-of-the-art conditions regarding HR in the time domain with acceptable precision.
To describe another output graphic of the tool, the analysis of specific condition fulfillment begins with an assessment of the barplots (see
Figure 3), which show the fulfillment of each analyzed condition. This image shows that there are just two conditions that are fulfilled by less than 70% of the patients: [HR] NREM ↓ and [HRV] NREM < REM, fulfilled by 49.31 and 65.98%, respectively.
The clinical dissemination of [HRV] NREM < REM is presented in
Table 7, showing that certain groups lower the overall condition fulfillment rate. Specifically, patients with apnea have a fulfillment rate of only 63.78%. Additionally, African American and White patients register 63.55% and 63.29%, respectively. Furthermore, individuals older than 85 years exhibit the lowest fulfillment rate at 61.97%. These findings highlight specific demographics and conditions associated with lower fulfillment rates in the clinical data. Notably, healthy individuals show a scoring of 65.84%, lower than those with insomnia (71.64%) or restless legs syndrome (71.05%), suggesting that underlying cardiac issues may affect HRV despite the absence of sleep-related conditions.
A sensitivity analysis was performed by varying the configurable threshold for tendency conditions, which defaults to 50% and is explained in
Section 2.2.2. Thresholds were tested from 10% to 90%. As shown in
Figure 4, [HR] NREM ↓ drops sharply after 40%, indicating high sensitivity to stricter criteria. In contrast, [HR] REM ↑ declines more gradually, with a notable decrease only after 60%, suggesting greater robustness across the cohort. The overall score does not decrease dramatically because this parameter affects only two of the six conditions that contribute to the total score. At a 50% threshold, the overall score was 74.12%. [HR] NREM ↓ was fulfilled in 49.31% of epochs, and [HR] REM ↑ in 80.73%. These results highlight the importance of threshold selection and demonstrate the flexibility of SOUP, which allows users to adjust criteria according to cohort characteristics and study requirements.
Temporal HR analysis confirms alignment with state-of-the-art conditions. However, specific demographic groups, such as patients with apnea and older adults, show lower scores, reflecting expected dysfunction of the autonomic nervous system.
3.2.3. HR—Frequency Domain
Analyzing HR within the frequency domain, the obtained score is 37.81%, which is considerably lower than the scores obtained for the previously examined signals.
To evaluate the fulfillment of each specific condition,
Figure 5 shows the bar plot of the percentage of fulfillment for each condition. There it is shown that [LF:HF] REM > NREM and [LF:HF] REM < Awake conditions are met by 70.11% and 65.90% of the patients, respectively, while the rest of them are met by 30% or less. This highlights a weak alignment with the state-of-the-art outcomes.
Table 8 shows the fulfillment rate of all conditions categorized by sleep stage. It is evident that the N3-related conditions are particularly problematic in this dataset, highlighting the difficulty of achieving consistent condition satisfaction during deep sleep.
Table 9 shows the clinical dissemination of the least fulfilled condition: [LF] N3 < All Stages. It shows that elderly individuals are less likely to meet this condition compared with younger individuals, revealing that age is a factor to consider when analyzing HR conditions in the frequency domain.
Analyzing the distribution of the sleep stages throughout the nights (
Figure 6), it is evident that only 9.47% of the epochs are labeled as N3 sleep. According to Ashish K. Patel et al. [
30], under optimal circumstances, 25% of sleep should be N3. Additionally, Dorffner et al. [
31] noted a decrease of 1.7% per decade for males, which still exceeds the observed percentage in this dataset. This discrepancy could contribute to the mismatch among the N3-related conditions, stemming from issues within the dataset, potential labeling inaccuracies, or the rigorous nature of the condition itself.
Another possible reason for these low scores could be the poor ECG signal quality observed in some intervals. Typically, resting HR ranges between 50 and 90 bpm [
32]. However, this dataset includes intervals where HR spikes to 330 bpm.
These findings highlight the complexity of interpreting N3-related patterns in real-world, community-based datasets. While the tool can detect discrepancies between expected and observed patterns, such differences should be considered in the context of cohort demographics and signal quality, rather than automatically assumed to indicate mislabeling. Decisions regarding adjustment of expected N3 conditions for older populations would require specialized clinical evaluation, which is beyond the scope of this study.
4. Discussion
In the domain of sleep signal analysis, there is a wide range of specialized tools to facilitate the investigation of different biomedical signals. To begin with, pyActigraphy [
33] is designed to process actigraphy data. This type of data, commonly obtained from wearable devices, provides valuable insight into activity patterns and circadian rhythms. Another notable tool is pyHRV [
28], known for its effectiveness in analyzing heart rate variability. This aspect of physiological investigation provides an understanding of sleep quality and the dynamics of the autonomic nervous system, both of which are crucial in understanding the nature of restful sleep. Moreover, in the context of automatic classification of sleep stages, the tool Counting Sheep PSG [
34] can be found. Designed to be EEGLAB-compatible, this software for MATLAB facilitates signal processing, visualization, event marking, and manual sleep stage scoring of polysomnography (PSG) data.
While these tools provide essential functionality for analyzing specific biomedical signals related to sleep, they fall short in offering a comprehensive framework for verifying the reliability of sleep stage annotations. Traditional tools typically focus on reclassifying sleep stages or processing signals, whereas our proposed tool, SOUP, addresses the critical need for validating the consistency of hypnogram–signal pairs against established state-of-the-art conditions. Unlike conventional tools, SOUP does not attempt to reclassify sleep stages but verifies the accuracy of provided annotations by comparing them with expected physiological patterns, such as activity and heart rate. This approach ensures the integrity and reliability of sleep stage annotations, which is crucial both for clinical practice and research. In this regard, SOUP represents a significant advancement over the state-of-the-art, filling an essential gap by systematically addressing the verification of sleep stage annotations and enhancing the reliability of sleep studies.
Beyond classical signal-processing tools, recent frameworks such as aSAGA or U-PASS introduce mechanisms to quantify uncertainty or ambiguity. As discussed in the Introduction, these methods help identify ambiguous or potentially unreliable epochs. However, their scope remains fundamentally different. In contrast to these approaches, which focus on model confidence or signal integrity, SOUP directly evaluates whether the physiological dynamics of the signals are consistent with the annotated sleep stages. This makes SOUP complementary rather than redundant: it provides a physiologically grounded validation step that current uncertainty-based or quality-metric tools do not offer.
SOUP acts as a “copilot” for sleep analysis, offering a systematic validation process for hypnogram labels. It is specifically designed to assess the alignment of sleep stage annotations with expected biomedical signal patterns. This approach makes SOUP a valuable asset for clinicians and researchers who rely on accurate sleep stage classification. Moreover, the tool is highly adaptable, capable of handling both individual and large-scale dataset analyses. It generates detailed reports that include textual summaries, tables, and graphics, making it accessible and valuable for both clinical practice and research.
From a clinical perspective, SOUP offers objective insights into sleep architecture, reducing interrater variability among sleep specialists and minimizing the time required for hypnogram generation. By enabling in-depth analyses of polysomnographic data, it enhances patient characterization, supporting more precise diagnostics and personalized interventions. Additionally, SOUP assists in identifying anomalies in sleep data, helping clinicians assess the clinical significance of deviations from established sleep patterns.
The tool is designed to be flexible, making it suitable for both research and clinical environments, independent of the source of hypnogram annotations, whether manual or automatic, or the number of annotators available. This flexibility is particularly valuable in clinical settings where only one specialist may be available for scoring, rather than the recommended three [
4]. SOUP’s scalability ensures that non-programmers can integrate new conditions and signals, thus expanding its applicability to various use cases and datasets.
To demonstrate its capabilities, the tool was applied to the MESA dataset, with two use cases: (i) a single-patient analysis, where errors in sleep stage classification were identified and corrected, and (ii) a full dataset evaluation, where systematic inconsistencies in sleep stage annotations were detected. The results showed that while activity and heart rate in the time domain generally aligned with expected conditions, compliance in the frequency domain was lower, particularly in N3 sleep. This discrepancy may indicate potential mislabeling or poor signal quality. It may also reflect cohort-specific characteristics, such as the advanced age of the MESA participants, whose physiology and sleep architecture can deviate from the normative patterns on which the conditions were originally defined. These considerations further reinforce the importance of verification tools like SOUP to enhance data integrity, while also highlighting the need to account for population-specific factors when interpreting condition fulfillment.
One of the key strengths of SOUP lies in its ability to analyze multiple signals and conditions simultaneously, ensuring the accuracy and reliability of sleep stage annotations. Furthermore, clinical professionals involved in this research, specifically neurophysiologists from the Sleep Unit at Hospital Universitario de la Princesa, have recognized SOUP as a valuable tool that can enhance clinical practice. By providing objective and systematic validation of sleep stage annotations, SOUP facilitates more consistent diagnostics and enables clinicians to efficiently identify anomalies or inconsistencies in polysomnographic data, ultimately contributing to improved patient care and optimized clinical workflows. Moreover, the structured reports generated by SOUP facilitate a deeper understanding of the data and provide a robust platform for targeted investigations into inconsistencies and anomalies. By providing both clinical and research benefits, SOUP offers a novel, efficient, and effective approach to improving the accuracy and reliability of sleep stage annotations, which is essential for advancing sleep medicine and related fields.