1. Introduction
Sleep disorders have been considered significant issues of the population that concern the health of millions of people worldwide and result in serious health problems, including cardiovascular disease, diabetes, and cognitive disabilities. In addition to the clinical burden, sleep disorders are also creating a consistently growing international market in diagnostic and therapeutic interventions. Sleep disorders include multiple conditions that range from insomnia and narcolepsy to periodic limb movement disorder and REM sleep behavior disorder, since each disorder displays specific neurophysiological patterns and clinical outcomes [
1]. Multiple factors combine to cause sleep disorders, including genetic factors, neurochemical imbalances, irregular sleep patterns, excessive screen time, and existing medical and mental health conditions [
2]. The world experiences increasing rates of these disorders, creating urgent demand for automated diagnostic systems that use objective measurements to assess sleep disorders in addition to traditional polysomnography evaluations.
Epidemiological evidence indicates that the prevalence and presentation of sleep disorders vary significantly with age. Insomnia prevalence increases with advancing age, affecting up to 50% of elderly populations, while narcolepsy onset typically peaks in adolescence (10–20 years) [
3]. PLM prevalence rises sharply after age 50, and RBD predominantly affects males over 60 years of age and is increasingly recognized as a prodromal marker of neurodegenerative disease [
4]. These age-related variations underscore the importance of developing classification models that are robust across diverse demographic profiles.
Traditionally, polysomnography (PSG) was the only method for accurately diagnosing sleep disorders, a process that involved the manual analysis of electroencephalography (EEG) with the help of trained sleep specialists, who identified particular sleep stages [
5]. Manual scoring of EEG signals, however, was time-consuming, subjective, and prone to inter-rater variability—hence, the need for automated classification systems [
6]. It was demonstrated that the nonstationary and complex characteristics of EEG signals limited the performance of traditional machine learning (ML) models that utilized handcrafted features [
7]. Additionally, EEG has some clinical uses, such in as the detection of seizures and brain–computer interface systems [
8].
The recent progress in the field of deep learning (DL), especially convolutional neural networks (CNNs), was revealed to provide considerable benefits to the process of automated analysis of EEG through allowing hierarchical learning of features directly from raw signals [
9]. CNN-based architectures have demonstrated strong performance in modeling EEG temporal patterns for automated sleep analysis tasks. While many prior studies focused on sleep stage classification, the present work instead targets sleep disorder classification, which requires distinguishing diagnostic groups rather than epoch-level sleep stages [
10]. However, individual deep learning models may struggle to fully capture complex transitional EEG dynamics and long-range temporal correlations that characterize different sleep disorder categories [
11]. In order to address these weaknesses, ensemble learning schemes were proposed, in which a group of models was trained together to take advantage of their complementary capabilities along with better generalization and robustness [
12]. At the same time, fractals were also shown to be useful in the scale-invariant time properties of EEG signals, and MF-DFA, in particular, was useful in long-range correlations of physiological signals [
13].
Although each of the CNNs, ensemble learning, and fractal features had been successful on an individual basis, their combination as a single network in terms of classifying sleep disorders had scarcely been studied. Previous research studies used either DL or fractal analysis separately, and they did not utilize the possibility of incorporating the learned spatial-temporal representations and explicit fractal-based time-dependent characteristics. Also, although hybrid feature-integration methods have been shown to improve the robustness of models, the combination of learned CNN features with fractal properties in EEG classification was not well examined. To close these gaps, the framework based on CNNs was utilized for extraction of the temporal features. Moreover, MF-DFA was integrated to extract fractal properties, allowing the model to capture multiscale temporal dynamics and long-range correlations that CNNs alone cannot represent. This combined methodology provided a complete representation of EEG signals with enhanced classification performance across a variety of sleep disorder categories, with person-independent capabilities adding to its generalization for various clinical situations.
The major contributions of this study are summarized as follows:
To classify sleep disorders, including insomnia, narcolepsy, REM behavior disorder, and periodic limb movement disorder, using MF- DFA extracted features, a CNN-based approach was proposed.
To analyze the impact of fractal analysis, the proposed framework was trained and evaluated on the CAP Sleep dataset, and its performance was systematically compared across settings that incorporated fractal features and those without them.
Analysis was conducted on the ISRUC dataset to perform cross-validation and prove the robustness of the model across various clinical environments.
The rest of the paper is structured as follows:
Section 2 is the literature review.
Section 3 describes the proposed CNN-based approach with fractal features extraction and model architecture, along with the datasets and experimental setup.
Section 4 presents the results, and finally,
Section 5 provides the conclusion.
2. Literature Survey
EEG-based systems automation has emerged as a dynamic research field, as a wide range of more traditional methods struggle to model multiscale temporal dynamics and scale-invariant structure simultaneously. Fractal EEG analysis captures long-range temporal correlations, while deep learning architectures capture localized waveform morphology. In this respect, recent research that has explored the fusion of fractal analysis with deep learning to improve sleep-related EEG classification serves as the basis for the following literature review.
Several studies have tackled the challenge of EEG-based classification across two research areas: sleep studies and neurological research. Ding et al. [
14] developed a Dynamic Graph Attention Network for emotion recognition, achieving 94.00% accuracy on SEED but failing to generalize across different subjects. Rosenblum et al. [
15] developed a fractal slope-based method to identify sleep cycles, achieving 91–98% REM detection accuracy across 205 healthy adults, although the method did not demonstrate complete subject-independent reliability. Madhav et al. [
16] assessed three classification methods, which included ML, DL, and Neural Tangent Kernel methods for multiclass sleep disorder diagnosis using the CAP dataset. Shazid et al. [
17] compared traditional machine learning methods with hybrid CNN–BiLSTM systems on a synthetic dataset, achieving 92% accuracy, although their study used non-physiological data, which limited clinical applicability.
The study by Hu et al. [
18] developed ST-GATv2 for automated sleep-stage classification, achieving 89.0% accuracy on the MASS-SS3 dataset. Wang et al. [
19] developed MGANet, which reached 87.3% accuracy on the SHHS dataset. Chen et al. [
20] developed TS-AGCMM, which achieved 89.1% accuracy by accurately identifying N2 sleep but failed to distinguish N1 sleep. Hazra and Ghosh [
21] used ExPANet for depression detection by analyzing EEG features using fractal and complexity measurements, achieving an F1-score of 89.1%. Duan et al. [
22] used a GAT–LSTM combination to achieve 99.80% accuracy on the Bonn dataset for epilepsy detection, and Wang et al. [
23] developed Seizure-NGCLNet, which used weakly supervised methods to detect seizures, achieving an AUC of 0.988, but both methods struggled with cross-dataset generalization.
It is important to note that most of these studies focus on sleep stage classification. In contrast, the present study addresses the more clinically oriented task of sleep disorder diagnosis, which involves distinguishing pathological conditions rather than sleep architecture stages. Satapathy et al. [
24] used ISRUC data to test their dual-channel 1D CNN, which achieved sleep-specific EEG classification accuracy of 78–79%. Wadichar et al. [
3] developed a hierarchical CNN-LSTM model on the CAP dataset, achieving 93.31% accuracy in CAP Phase B testing. However, the study faced challenges because it used single-center data and did not include nonlinear features. Urtnasan et al. [
25] proved that single-lead ECG-based classification could achieve F1-scores of 95–99% through their CNN-GRU system. Kolhar et al. [
26] developed a PSO-optimized LSTM with SHAP-based interpretability, achieving 97% accuracy on CAP data, but the system lacked external validation.
Table 1 presents complete details of the studies, including their research methods, obtained results, and study limitations.
Recent progress in EEG-based classification has adopted transformer-based and self-supervised learning systems as its main foundation. The AttnSleep [
27] and SleepTransformer [
28] models achieved sleep stage classification by maintaining long-range sequential dependencies, which they tracked via self-attention mechanisms. Self-supervised pre-training approaches that use contrastive learning on unlabeled EEG segments have proven effective in reducing the need for extensive annotated datasets, according to [
29]. The methods offer benefits for sequential modeling and label efficiency, yet require organizations to acquire extensive training data and advanced computing power. At the same time, their systems fail to detect multifractal sleep EEG patterns that exhibit scale-invariant behavior. The MF-DFA–CNN framework established essential long-range temporal relationships through its fractal analysis method, which conventional transformers and self-supervised systems failed to capture.
Synthesis Overview
The reviewed literature reveals several converging trends and persistent gaps. Methodologically, the majority of studies used single-model frameworks employing CNNs, LSTMs, or graph attention networks to process either temporal or spatial EEG dynamics, but rarely both simultaneously. Graph-based attention methods improved spatial dependency modeling [
14,
18,
20] while convolutional architectures excelled at temporal pattern extraction [
3,
16,
17,
24]. The research needs to explore how modern deep learning systems work with fractal and nonlinear EEG descriptors, as existing studies [
15,
16] used shallow classifiers that could not handle large datasets.
The standard benchmarks, including MASS-SS3, ISRUC, and SHHS, enabled reproducible testing across multiple datasets. Yet, they consistently demonstrated that subjects could not be tested without their specific identities and failed to identify N1-stage patterns. The research studies reported high accuracy rates ranging from 85% to 99%, but most findings emerged from laboratory tests that lacked sufficient external validation. The research faced three main obstacles: difficulties maintaining class balance, the use of patient-specific modeling [
23], and the lack of techniques for combining features from multiple domains.
The present study uses MF-DFA-based fractal descriptors together with CNN-based temporal features to create a system that tracks both nonlinear dynamics and temporal patterns, thereby fulfilling the needs of EEG multiscale representation and cross-dataset performance. Although several reviewed studies focus on sleep stage classification, they are included here due to their methodological relevance to EEG-based modeling frameworks. The present study, however, specifically addresses sleep disorder diagnosis, which differs from sleep stage classification in its clinical objective and label structure.
3. Methodology
This study presents an end-to-end pipeline for automatic sleep disorder classification using single-channel EEG recordings with multi-domain feature extraction and hybrid deep learning to discriminate among sleep disorders, as shown in
Figure 1. The framework was designed for cross-dataset generalization, applying a unified strategy to the CAP Sleep Dataset (5 classes: Healthy, Insomnia, Narcolepsy, PLM, RBD) and the ISRUC Sleep Dataset (3 classes: Healthy, PLM, RBD). Raw EEG signals underwent filtering, normalization, and segmentation into fixed-length windows, followed by the extraction of statistical multifractal (MF-DFA) and Wavelet Scattering Network (WSN) features. A hybrid two-branch architecture integrated two complementary modalities: (1) raw EEG temporal features extracted by a one-dimensional EfficientNet-style CNN backbone (producing a 1024-dimensional embedding via dual global pooling), and (2) handcrafted multi-domain descriptors including MF-DFA generalized Hurst exponents, spectral band powers, and statistical features (34 features total) processed through a three-layer fully connected fractal branch (producing a 256-dimensional embedding). The two embeddings are concatenated (1280 dimensions) and fused via a learned attention gate (Dense(1280→320) → ReLU → Dense(320→1280) → Sigmoid, applied element-wise), which adaptively weights the contribution of each modality before the multi-layer classification head produces class probabilities. The complete architecture is described in
Section 3.4.4.
3.1. CAP Sleep Dataset
The CAP Sleep Dataset contains five diagnostic groups: Healthy, Insomnia, Narcolepsy, PLM, and RBD. We used a balanced EEG with a 2 s epoch and 1024 samples at 512 Hz. To ensure class balance, 9000 segments per group were retained, yielding 45,000 samples in total. Partitioning was performed at the recording (subject) level; all segments from a given recording were assigned exclusively to one of the 70% training, 15% validation, or 15% test partitions, ensuring subject-disjoint splits with no data leakage across partitions. Stratified sampling was applied at the subject level to maintain approximate class balance within each partition. The EEG frequency bands relevant to sleep disorder classification and their characteristic ranges are as follows: delta (0.5–4 Hz), which dominates during deep NREM sleep (N3); theta (4–8 Hz), prominent during light sleep (N1, N2) and associated with drowsiness; alpha (8–13 Hz), observed during relaxed wakefulness and suppressed during sleep onset; beta (13–30 Hz), associated with wakefulness and cortical arousal; and gamma (30–100 Hz), linked to cognitive processing and occasionally observed during REM sleep. Different sleep disorders exhibit characteristic EEG patterns within these bands: insomnia patients frequently show elevated beta-band activity reflecting hyperarousal, narcolepsy is characterized by rapid transitions into REM with increased theta activity, PLM is associated with periodic cortical arousals visible in delta–theta transitions, and RBD exhibits increased phasic EMG activity during REM with altered alpha–theta dynamics [
1,
5]. The balanced class distribution of the test set across all five diagnostic categories is shown in
Figure 2. All segments were filtered and normalized. When required, sampling with replacement was applied to maintain equal class counts. Our data preprocessing and balancing approach is inspired by [
3] the CAP Sleep Dataset and the balanced and pre-segmented CSV files used in that study.
To clarify that the five classes in the CAP dataset, Healthy, Insomnia, Narcolepsy, PLM, and RBD, correspond to clinical sleep disorder diagnoses, not to individual sleep stages (e.g., N1, N2, N3, REM, Wake) as defined by the American Academy of Sleep Medicine (AASM), each recording in the CAP dataset is labeled according to the patient’s diagnosed sleep condition, and the classification task addressed in this study is the identification of sleep disorder categories from EEG segments, rather than the epoch-by-epoch staging of sleep architecture. This distinction is maintained consistently throughout the manuscript.
3.2. ISRUC Sleep Dataset and Cross-Dataset Transfer Protocol
The ISRUC dataset consists of polysomnographic recordings from 10 subjects spanning three diagnostic categories: Healthy, PLM, and RBD. EEG channels were automatically identified (C3, C4, F3, F4, O1, O2, Cz, Pz). When multiple relevant channels were available, they were averaged to produce a single composite EEG signal per subject, ensuring compatibility with the single-channel pipeline used for CAP. The signals were divided into 1024-sample epochs at 256 Hz (4 s duration) using a 50% overlap sliding window, and flat-line segments (standard deviation < 10−6) were discarded. Minority classes were augmented via additive Gaussian noise (σ = 0.01), random amplitude scaling (factor 0.9–1.1), and temporal shifts (±50 samples).
To evaluate cross-dataset generalization, we adopted a transfer learning protocol rather than training a new model from scratch. The CAP-trained model’s CNN backbone and fractal feature branch were frozen, preserving the feature representations learned during CAP training. Only the final classification head was replaced (from 5 outputs to 3) and fine-tuned. A small labeled ISRUC subset comprising no more than 50 segments per class was used for this head-only fine-tuning over 35 epochs; these adaptation samples were drawn exclusively from the training partition. The rationale for this protocol is threefold: (1) the CAP and ISRUC datasets differ in sampling rate (512 Hz vs. 256 Hz), recording equipment, and available diagnostic classes (5 vs. 3), necessitating minimal supervised adaptation of the decision boundary; (2) the frozen backbone ensures that the discriminative power of the transferred features, not ISRUC-specific overfitting, drives classification; and (3) the small number of adaptation samples (≤150 total) prevents the model from memorizing ISRUC-specific patterns, preserving the integrity of the cross-dataset evaluation. Data partitioning was performed at the subject level: subjects used for fine-tuning were excluded from the test set, ensuring that no subject appeared in both partitions. The held-out ISRUC test set comprised 360 segments (120 per class) from subjects not included in the fine-tuning. The detailed configuration and stratified splits for both datasets are presented in
Table 2.
3.3. Preprocessing and Normalization
The complete preprocessing pipeline is summarized as follows. First, a 4th-order Butterworth band-pass filter with cutoff frequencies of 0.3 Hz and 100 Hz was applied to remove baseline drift and high-frequency noise while retaining physiologically relevant frequency content up to the gamma band. A notch filter centered at 50 Hz (Q = 30) was subsequently applied to suppress power-line interference. For the ISRUC dataset, signals were resampled from their native sampling rate to 256 Hz using polyphase anti-aliasing filtering to ensure uniform temporal resolution; CAP signals were retained at 512 Hz.
The procedure for artifact rejection involved calculating the standard deviation of each divided epoch and discarding segments with a standard deviation below 10
−6, as this threshold identified both flat-line and disconnected-electrode artifacts. The system required complete automation because its developers designed it to function without human involvement, leading them to avoid independent component analysis (ICA) and manual artifact rejection methods. The researchers applied z-score normalization to each retained epoch using Equation (1). They replaced all remaining NaN and infinite values with zeros to ensure numerical stability during feature extraction and model training.
The CAP dataset required EEG signals to be divided into 2-s segments, each containing 1024 samples recorded at 512 Hz. The ISRUC dataset required continuous EEG recordings to be divided into 4-s segments, each containing 1024 samples recorded at 256 Hz, using a sliding window with 50% overlap. The researchers used stratified sampling to achieve class balance while augmenting underrepresented classes in the ISRUC dataset with additive Gaussian noise (σ = 0.01), random amplitude scaling (0.9–1.1), and temporal shift (±50 samples).
3.4. Feature Extraction Pipeline
A multi-domain feature extraction framework was designed to capture complementary EEG characteristics across time, frequency, and nonlinear dynamics. For each preprocessed EEG segment, spectral wavelet, multifractal, and scattering-based descriptors were computed and concatenated into a single feature vector. For CAP, an initial ~200-dimensional feature set was produced; after removing zero-variance features, mutual information-based SelectKBest was fitted exclusively on the training partition to retain the top 120 features most dependent on the class label; the identical feature mask was then applied to the validation and test sets without re-fitting, thereby preventing any test-label information leakage. For ISRUC, the pipeline produced a fixed 120-dimensional feature vector directly to maintain compatibility across datasets. It should be noted that the 120-dimensional selected feature vector served as the basis for feature importance analysis, while a compact subset of 34 key descriptors (13 MF-DFA descriptors, 8 spectral band powers, 1 Higuchi fractal dimension, and 12 statistical measures) was used as the direct input to the fractal branch of the hybrid deep learning model (
Table 3). This distinction ensures that the fractal branch receives only the most relevant and non-redundant descriptors for effective fusion with CNN-based temporal embeddings. All features were subsequently normalized using a robust scaler. Robust scaling is defined as in Equation (2):
to reduce sensitivity to outliers. NaN/Inf values were replaced with zeros before model training.
3.4.1. Statistical and Time Domain Features
About 25 features of the time domain were extracted to describe the amplitude distribution and temporal variability. This included mean and median; measures of dispersion and shape; percentile markers and interquartile range; energy measures; measurement of temporal variation: line length, zero-crossing rate; Hjorth parameters: activity, mobility, complexity, and peak-density measures to reflect transient activity.
3.4.2. Spectral Features
Spectral features were calculated on Welch’s PSD with a window size of 256 and a half-overlap. Ability to extract absolute and relative band powers was carried out at the delta (0.5–4 Hz) and theta (4–8 Hz) range and alpha (8–13 Hz), beta (13–30 Hz), and gamma (30–100 Hz) range. Ratios of band power were theta/alpha, delta/beta, alpha/beta, theta/beta, and delta/gamma.
Other descriptors included spectral edge frequencies of 50, 75, 90, 95, spectral entropy (seen in Equation (3)), spectral moments (peak frequency, mean/centroid frequency, variance), and spectral shape (roll-off, bandwidth, flatness). Spectral entropy was calculated as in Equation (3):
where
denotes the normalization power spectral density.
3.4.3. Multifractal Detrended Fluctuation Analysis (MF-DFA)
MF-DFA was used to describe the multiscale temporal variation of sleep EEG. Long-range temporal correlations in sleep EEG contain intricate variations across sleep disorder categories, which are difficult to model with conventional linear signal processing or purely data-driven deep learning models. MF-DFA offers a mathematically sound approach to the quantification of such multifractal behavior and was incorporated alongside the existing temporal measures in the proposed classification framework.
MF-DFA was applied to each EEG segment to obtain generalized Hurst exponents that characterize the level of correlation and temporal complexity in the signal. With an EEG time series x(t) of length
N, the first step of the analysis was to remove the mean of the signal and build a cumulative profile transforming the signal into a random-walk-like process to which fluctuation analysis was applicable, as shown in Equation (4):
where
is the mean of the EEG signal. This process eliminates constant trends and preconditions the signal for multiscale analysis.
The profile Y(
i) obtained was then cut into non-overlapping segments of equal length
s, with each segment being defined as shown in Equation (5):
The overall signal length was not necessarily a perfect multiple of the scale s, so the segmentation process was repeated at the other end of the signal; 2Ns segments were obtained, and no section of the EEG epoch was lost. The scale parameter s was set to range between 10 and 100 to represent changes in time across various levels, and this was especially needed in differentiating short transient trends against long-term associations of various sleep disorders.
In every segment
v, least-squares linear regression was used to eliminate local trends due to slow drifts or artifacts. Detrended variance of every segment was calculated as shown in Equations (6) and (7):
for
v = 1, 2, …,
Ns, and
where
yv (i) is the fitted local trend in segment
v and
v =
Ns + 1, …, 2
Ns. The detrending stage separated the inherent variations of the EEG signal by removing trends of polynomials that might otherwise bias the scaling behavior.
The
q-order fluctuation function was calculated as an average of the detrended variances to measure the size of fluctuations across all segments, as defined in Equation (8):
In the case of
q = 0, numerical instability was avoided by a logarithmic averaging, as shown in Equation (9):
The parameter q was used to regulate the sensitivity of the analysis to changes of various magnitudes, where positive values underscore large changes and negative values underscore small changes. In the case of q = 2, the analysis was reduced to standard detrended fluctuation analysis (DFA).
The dependence of the fluctuation function on the scale s followed power-law behavior, as shown in Equation (10):
Hurst generalized
h(
q) was approximated by the slope of the log-log relationship between F
q (s) and s. The moment order q was varied from −5 to +5 (step = 0.5), and the scale s ranged from 10 to 100. The full multifractal spectrum h(q) was computed for all classes and is shown in
Figure 3B, which reveals that inter-class separation is strongest at positive q values, particularly at q = 1 where the curves for narcolepsy (h(1) = 0.58 ± 0.11) and Healthy (h(1) = 0.72 ± 0.08) are most divergent (see
Section 4.2.1 for all class-wise values). Among all q values, h(1) yielded the highest between-class effect size (Cohen’s d > 1.0 for Healthy vs. Narcolepsy) while maintaining the lowest within-class variance, making it the most stable and discriminative scalar summary of the multifractal spectrum. Accordingly, h(1) was selected as the primary MF-DFA feature for each EEG segment. This choice is consistent with established MF-DFA practice, in which h(q = 1) corresponds to the standard Hurst exponent characterizing the overall long-range correlation strength [
13,
15]. These MF-DFA representations were concatenated with CNN-based temporal embeddings and fused via an attention gating mechanism, enabling the model to jointly leverage scale-invariant fractal dynamics and learned temporal features for enhanced sleep disorder classification. The selection of h(1) over alternative multifractal descriptors—including spectrum width Δh, multi-q subsets, and the full h(q) spectrum—was empirically validated through a controlled ablation study reported in
Section 4.2.2. The two-dimensional t-SNE projection illustrating the class-wise structure and separation achieved by combined MF-DFA and CNN-based features is shown in
Figure 3.
The parameter selection process used established MF-DFA literature [
13,
15] for its two criteria, which needed to be verified through testing to achieve maximum separation between different sleep disorder groups. The complete set of MF-DFA parameter settings used in this study is summarized in
Table 4.
3.4.4. CNN-Based Temporal Feature Extraction
Raw EEG signals were fed through a one-dimensional CNN to learn the discriminative temporal features to classify sleep disorders. EEG waves are nonstationary and oscillatory in nature, with transient waveforms and sudden changes in time with the changes in sleep disorders and pathological processes. The convolutional operations were used over the temporal dimension to allow the automatic learning of local temporal patterns, but on raw EEG data, which removed the necessity to craft features by hand.
Considering a one-dimensional EEG signal
, the convolutional feature maps were calculated as shown in Equation (11):
where
wk,i,l represents the convolution kernel,
is the bias term, and C represents the input channels. Then, nonlinear activation functions were put into place to improve representational capacity, which was expressed as shown in Equation (12):
This allowed complex temporal EEG dynamics to be properly modeled.
In order to enhance training stability and generalization, batch normalization was used, which is defined as shown in Equation (13):
with
and
as the batch-wise mean and variance, respectively.
The deeper temporal representations were obtained through the convolutional layers that were stacked with residual links and formulated as shown in Equation (14):
This stored the low-level temporal information and facilitated gradient propagation.
The global average pooling was used to perform temporal aggregation in order to obtain fixed-length feature vectors, which were computed as shown in Equation (15):
where
T is the length of the time dimension following convolutional processing. The resulting CNN-based features represented oscillatory patterns, transient EEG events, and short- to medium-range interdependencies in the temporal interdependencies that are useful in the characterization of sleep disorders. The extracted features were further supplemented with multifractal features obtained via MF-DFA since CNNs mostly capture localized temporal variations, whereas MF-DFA explicitly quantifies long-range temporal correlations and multiscale dynamics that are inaccessible to convolutional operations alone.
The complete architecture of the proposed hybrid CNN + MF-DFA model is summarized in
Table 3. The CNN branch (Modality 1) processes the raw 1024-sample EEG segment through a multiscale convolutional front-end (kernel sizes 7 and 31 for fine and coarse temporal resolution, respectively), three stacked residual blocks with Squeeze-and-Excitation (SE) channel recalibration (reduction ratio = 16), and dual global pooling (average + max, concatenated), producing a 1024-dimensional temporal embedding. The fractal branch (Modality 2) takes the 34-dimensional handcrafted feature vector (13 MF-DFA descriptors, 8 spectral features, 1 Higuchi fractal dimension, 12 statistical descriptors) through three Dense → BatchNorm → ReLU → Dropout layers, producing a 256-dimensional fractal embedding. Both embeddings are concatenated into a 1280-dimensional vector and passed through an attention gate: a two-layer MLP (1280 → 320 → 1280) with Sigmoid activation generates element-wise attention weights that modulate the fused representation, allowing the network to balance temporal and fractal contributions adaptively. The attended representation feeds a three-layer classification head (1280 → 512 → 256 → C, where C = 5 for CAP or 3 for ISRUC) with Softmax output.
All Conv1D layers used bias = False when followed by BatchNorm. The multiscale front-end (Stages 1a–1b) captured both fine-grained waveform morphology (kernel = 7) and broader oscillatory patterns (kernel = 31). The SE blocks performed channel-wise recalibration with a reduction ratio of 16. The fractal branch input comprised 38 features: MF-DFA generalized Hurst exponents and derived descriptors (13), spectral band powers and shape features (8), Higuchi fractal dimension (1), statistical descriptors (12), and segment-level Hurst estimates (4).
3.4.5. Wavelet Scattering Network (WSN) Features
For CAP, wavelet scattering features were computed for obtaining deformation-stable and translation-invariant representations. The scattering transform employed J = 6 and Q = 8 wavelets per octave, resulting in first- and second-order coefficients that represented the energy distribution and cross-scale modulation. Summary statistics were calculated per order and scaled to provide a ~55-dimensional vector, which was concatenated with other features and padded/clipped to provide a consistent dimension.
3.5. Software and Hardware Configuration
This section describes the computational environment employed to implement, train, and evaluate the proposed sleep disorder classification framework. All experiments were conducted on a GPU-enabled system using Python-based deep learning libraries. While the CPU handled tasks such as feature extraction and data preprocessing, the GPU was leveraged to accelerate model training and optimization, ensuring efficient execution of computationally intensive operations.
The framework’s computational efficiency was evaluated on the hardware in
Table 5, with average inference times of ~12 ms per 2-s EEG epoch on an NVIDIA RTX 3090 and ~85 ms on a CPU (Intel Core i7), supporting real-time processing. The model contains 4.2 million parameters, which, together with its approximately 17 MB size, make it appropriate for edge and clinical applications. The system requires three essential elements for practical implementation: prospective clinical validation, regulatory compliance, and existing system integration.
4. Results and Discussion
4.1. Experimental Configuration
The suggested CNN framework and MF-DFA feature integration were tested on two stand-alone polysomnographic datasets to detect sleep disorders automatically. The CAP Sleep Dataset encompasses five diagnostic classifications: Healthy, Insomnia, Narcolepsy, PLM, and RBD, whereas the ISRUC Sleep Dataset includes only three classifications: Healthy, PLM, and RBD. All measures were calculated on stratified held-out tests.
The multi-domain features that were extracted by the framework included statistical measures, MF-DFA parameters, and nonlinear complexity descriptors. Training parameters were dataset-specific: CAP used a batch size of 32, number of classes = 5, learning rate = 2 × 10
−5, 1 × 10
−5 weight decay, 0.35 dropout, and 250 epochs of training, while ISRUC used the batch size 32, number of classes = 3, learning rate = 2 × 10
−5, weight decay = 1 × 10
−5, and dropout = 0.35, trained for 35 epochs. Both datasets were trained on NVIDIA computers.
Table 6 provides a comprehensive summary of all hyperparameters and dataset configurations for both the CAP and ISRUC datasets.
4.2. CAP Sleep Dataset Performance
On CAP, the framework was found to have a test accuracy of 86.38% and macro-averaged precision of 88.05%, recall of 88.15%, and F1-score of 88.02%. Training leveled off at epoch 240 with validation accuracy 88.35% and F1-score 88.24%. The performance level was close to inter-rater variation (0.70- 0.85) in manual polysomnographic scoring. The training and validation accuracy curves across 250 epochs demonstrated stable convergence of the model, as shown in
Figure 4.
Overall classification metrics on the CAP Sleep Dataset test set achieved an accuracy of 86.38%, as shown in
Table 7. Per-class analysis results in
Table 8 show that Narcolepsy had a 92.96% F1-score, 90.91% precision, 95.11% recall; Insomnia had a 92.53% F1-score, 91.27% precision, and 93.82% recall; and PLM had 85.04% precision, 78.31% recall, and 81.54% F1-score. The PLM and RBD, with reduced accuracy of 81.54% and 83.22%, reflected inherent EEG microarchitectural overlap in motor-related sleep disorders.
In addition to the metrics reported above, to further account for class imbalance, the following aggregate metrics were computed on the CAP test set: macro-averaged F1-score = 88.02% and Cohen’s Kappa = 0.8519. The Matthews Correlation Coefficient (MCC) was 0.8298, confirming strong multiclass discriminative performance beyond accuracy. The area under the receiver operating characteristic curve (AUC) for each class ranged from 0.964 (PLM, RBD) to 0.992 (Narcolepsy).
The analysis of the confusion metrics revealed that Healthy was the one with the true positive rate of 93.16%, with the major confusions being to PLM of 2.84% and to RBD of 2.13%, as shown in
Figure 5. The highest level of accuracy was 93.82% in Insomnia. Narcolepsy had a recall of 95.11%. The bidirectional PLM–RBD confusion rates (7.16% and 6.84%) coincided with the epidemiological findings of 30–60% PLM prevalence in patients with RBD, which suggests similar motor manifestations to be verified by electromyography.
4.2.1. Interpretability of MF-DFA Features Across Sleep Disorder Classes
The discriminative power of MF-DFA features was evaluated via the generalized Hurst exponent h (q = 1) on the CAP test set: Healthy = 0.72(0.08), Insomnia = 0.65(0.09), Narcolepsy = 0.58(0.11), PLM = 0.69(0.10), and RBD = 0.67(0.10). Lower h(1) in Narcolepsy reflects fragmented sleep, while higher values in Healthy indicate more stable EEG dynamics, consistent with class separation observed in the t-SNE (
Figure 3A).
Figure 3B shows the full multifractal spectrum h(q) (q = −5 to +5), revealing stronger inter-class separation at positive q values and supporting true multifractal EEG dynamics. Spectral widths further differentiated classes: Healthy and PLM showed broad spectra (higher complexity), while Narcolepsy and Insomnia showed narrow spectra (lower complexity), demonstrating that MF-DFA captures physiologically meaningful, class-specific temporal patterns complementary to CNN features.
The sleep disorder classification model appears without MF-DFA in
Figure 6. The training loss decreased from approximately 12 to 1 over 250 epochs, while the validation loss stabilized near 1, indicating mild overfitting yet preserved generalization. The system achieved training accuracy of approximately 85.12% and validation accuracy of 82.14%, while the F1-score improved from 0.55% to 0.8616%, demonstrating equal precision in measuring results and detecting items. The learning rate changed from 0.001 to 0.004 during the training process. The system exhibited minor overfitting, yet the validation results showed successful feature extraction because they created a reference point to assess the added value of MF-DFA features.
The graph in
Figure 7 displays the performance results for each class when the MF-DFA method is not applied. The two sleep disorders Narcolepsy and Insomnia achieved performance above 90% in all evaluation criteria, whereas PLM and RBD presented greater obstacles that resulted in F1-scores of approximately 80%, indicating the difficulty in distinguishing these particular classes.
We employed McNemar’s test to evaluate the statistical significance of MF-DFA contribution by comparing the CAP test set predictions from the complete model that included MF-DFA features and the model that lacked MF-DFA features. The test produced a χ2 of 14.73, with a p-value of 0.0001, indicating that the performance enhancement from MF-DFA features was statistically significant at the p < 0.001 threshold. The macro F1-score difference between the two configurations was 88.02% and 86.16%, exceeding the 95% bootstrap confidence interval margin, thus proving that fractal features had a significant impact on the study.
The testing results showed a macro F1 score of 88.02%, a weighted F1 score of 86.40%, and a Cohen’s Kappa value of 0.8519, which demonstrated a strong capacity to manage class imbalance on the CAP test set. The multiple classification system achieved strong performance with a 0.8298 MCC score, which showed effective distinction between multiple classes, while class-wise AUC values ranged from 0.964 for PLM and RBD to 0.992 for Narcolepsy, as shown in
Figure 8.
4.2.2. Ablation Study: MF-DFA Feature Selection
Table 9 presents a systematic ablation across six MF-DFA feature configurations, isolating the contribution of each fractal descriptor while keeping all other pipeline components fixed. Configuration A, supplying only the generalized Hurst exponent h(1), achieved the highest accuracy (86.38 ± 0.42%) and F1-macro (88.02 ± 0.38%), confirming its role as the single most discriminative fractal feature. Removing all MF-DFA features (Configuration F) reduced accuracy to 82.14 ± 0.81% and F1-macro to 86.16 ± 0.74%, a 4.24% accuracy deficit and a 3.41% F1 deficit, quantifying the net contribution of fractal augmentation to the hybrid architecture. This degradation is statistically significant, consistent with McNemar’s test result reported in
Section 4.2 (χ
2 = 14.73,
p < 0.001). Substituting the multifractal spectrum width Δh as the sole fractal input (Configuration B) recovered only part of this gap (83.72% accuracy), indicating that Δh conflates positive- and negative-moment scaling behavior into a single summary statistic, thereby discarding the fine-grained temporal-correlation information encoded in h(1). Configurations C and D supplied richer moment-order subsets of three and 11 generalized Hurst exponents, respectively, yet both underperformed Configuration A (85.53% and 84.91%). The decline from C to D is consistent with a curse-of-dimensionality effect: the compact three-layer fractal branch (34 → 512 → 512 → 256) has limited capacity, and supplying 11 correlated exponents introduces redundant dimensions that dilute the gradient signal during training. Configuration E appended Δh to h(1); its accuracy (85.86 ± 0.45%) marginally exceeded the multi-q configurations but remained 0.52 points below Configuration A, confirming that spectrum width adds partially redundant rather than complementary information. These results, combined with the effect-size analysis in
Section 4.2.1, where h(1) exhibited the largest between-class Cohen’s d (>1.0 for Healthy vs. Narcolepsy) and the lowest within-class coefficient of variation (CV < 0.09), provide converging empirical and statistical justification for selecting h(1) as the sole MF-DFA descriptor in the proposed pipeline.
4.3. Cross-Dataset Transfer Evaluation on the ISRUC Sleep Dataset
To assess the transferability of the representations learned on the CAP dataset, the CNN backbone and fractal branch were frozen, and only the three-class classification head was fine-tuned on a small labeled ISRUC subset (≤50 segments per class, drawn from subjects disjoint from the test set). This protocol bridges the domain gap arising from differences in sampling rate (512 Hz vs. 256 Hz), recording equipment, the number of diagnostic classes (5 vs. 3), and patient demographics, while ensuring that high-level feature representations, rather than ISRUC-specific memorization, drive classification.
On the held-out ISRUC test set (360 segments, 120 per class, from subjects unseen during fine-tuning), the framework achieved an accuracy of 87.50%, macro-precision of 87.90%, and macro-recall of 87.50%. The macro-averaged F1-score was 87.41%, weighted F1-score 87.55%, and Cohen’s Kappa 0.8125. The Matthews Correlation Coefficient (MCC) reached 0.8137. Per-class AUC values shown in
Figure 9 were 0.967 (Healthy), 0.955 (PLM), and 0.999 (RBD), demonstrating robust class separation despite disparities in class distributions. The fact that a model trained on an entirely different dataset (CAP) achieved 87.50% accuracy on ISRUC with only classification-head fine-tuning on ≤150 labeled segments provides strong evidence that the MF-DFA and CNN features support cross-dataset generalization under limited fine-tuning. The adaptation converged at epoch 27, achieving the highest validation F1-score of 90.02%, as shown in
Figure 10.
The ISRUC cross-dataset evaluation was conducted on 360 test segments (120 per class: Healthy, PLM, RBD), derived from polysomnographic recordings of 10 subjects. To assess statistical reliability, the test set was evaluated using 1000-iteration bootstrap resampling with replacement. The resulting 95% confidence intervals for the overall accuracy were [84.72%, 90.28%], for macro F1-score [84.15%, 90.67%], and for Cohen’s Kappa [0.7688, 0.8563]. These intervals indicate that the observed cross-dataset performance was statistically robust and not attributable to sampling variability. It should be noted, however, that the relatively small sample size of 10 subjects limits generalizability claims, and validation in larger, more diverse cohorts is recommended in future work.
Overall classification metrics on the ISRUC Sleep Dataset test set achieved an accuracy of 87.50%, as shown in
Table 10. Class-wise results indicated RBD with 97.91% F1-score, 98.32% precision, 97.50% recall, Healthy with 84.50% F1-score, 78.99% precision, 90.83% recall, and PLM with 79.82% F1-score, 74.17% recall, and 86.41% precision, respectively, as shown in
Table 11. The strong performance of RBD implied that ISRUC recordings had more discriminative EEG signatures than CAP.
The confusion matrix showed Healthy with 90.83% recall, 9.2% misclassified as PLM; PLM with 74.2% recall, 24.2% misclassified as Healthy; and RBD with 97.5% recall, 2.5% misclassified as PLM. The asymmetric Healthy–PLM conglomeration (9.2% versus 24.2%) indicated that PLM segments usually resembled the EEG features of Healthy controls during non-movement periods, as shown in
Figure 11.
The ROC analysis revealed how different thresholds affected the ability to distinguish between classes. The ISRUC database showed one-vs-rest AUC results of 0.967 for Healthy, 0.955 for PLM, and 0.999 for RBD, which demonstrated high detection capabilities and low false identification rates, particularly for RBD. For the CAP dataset, the one-vs-rest ROC curves are shown in
Figure 8, with multiclass AUC values ranging from 0.964 to 0.992, demonstrating effective discrimination between different disorders through balanced assessment methods. The hybrid CNN-MF-DFA model achieved better performance than standard CNNs by successfully capturing both short-term and long-term EEG patterns from sleep disorder classification across multiple datasets.
5. Conclusions
This study presented a hybrid EEG-based framework for automated sleep disorder classification that integrated CNN-based temporal feature learning with MF-DFA-derived multifractal descriptors to capture both short-term dynamics and long-range temporal correlations characteristic of distinct disorders. Evaluated on the CAP Sleep Dataset across five diagnostic categories, the model achieved 86.38% accuracy, 88.02% macro F1-score, and a Cohen’s Kappa of 0.8519, with narcolepsy showing the highest performance (F1 = 92.96%) due to its distinctive fractal properties. An ablation study supported by McNemar’s test (
p < 0.001) confirmed the statistical significance of MF-DFA features, as their removal reduced validation accuracy to 82.14% and degraded per-class performance, underscoring their discriminative contribution. A systematic feature-selection ablation shown in
Table 9 further demonstrated that replacing h(1) with alternative multifractal descriptors consistently degraded performance, while complete removal of MF-DFA features caused a 4.24% accuracy drop, confirming that the fractal contribution is feature-specific rather than merely additive. Cross-dataset transfer evaluation on the ISRUC Sleep Dataset, where the CAP-trained feature-extraction backbone was frozen and only the classification head was fine-tuned on ≤50 labeled segments per class from disjoint subjects, yielded 87.50% accuracy (95% CI: 84.72–90.28%), demonstrating that the learned representations generalize across differing acquisition protocols, sampling rates, and patient populations. From a deployment perspective, the framework achieved an average inference time of approximately 12 ms per 2-s epoch on an NVIDIA RTX 3090 GPU and 85 ms in CPU-only mode, with a compact size of 4.2 million parameters (~17 MB), supporting real-time and edge-based applications. Despite strong overall performance, discrimination between PLM and RBD remained comparatively lower, reflecting inherent EEG microarchitectural overlap. The study did not include age-stratified analysis due to limited demographic metadata, even though age is known to influence sleep architecture and the prevalence of sleep disorders. Furthermore, prospective clinical validation, regulatory considerations, and integration with clinical systems are necessary before real-world deployment. Future research should incorporate multimodal signals, larger demographically diverse cohorts, age-stratified evaluation, and explainable AI techniques to enhance generalizability, interpretability, and clinical applicability.