1. Introduction
Deep neural networks have achieved significant advances in analyzing electroencephalographic (EEG) time series [
1], ranging from brain-computer interfaces [
2] to the intricacies of sleep stage scoring [
3,
4]. Such successes are attributed to the ability of deep neural networks, as universal function approximators, to learn properties (features) from patient data that are difficult for humans to conceptualize and define. However, training neural networks requires large and diverse datasets that capture the considerable variety between individual subjects and their medical conditions (subject heterogeneity). Creating such datasets is challenging due to the typically limited amount of data per subject (data scarcity) and diverse measurement protocols used in different clinics, which can introduce additional variability in the data. Furthermore, acquiring large datasets is often expensive, complicated, or even intractable due to strict privacy policies and ethical guidelines. This hinders the advancement of deep neural networks for widespread application in real-world medical settings.
Efforts to mitigate the scarcity of large datasets have primarily followed two paths: (1) the development of network architectures that incorporate constraints mirroring the data’s intrinsic characteristics, such as symmetries [
5], and (2) enhancing model performance with additional or cross-domain data to learn effective priors. Pertaining to the first path, a common feature in time series processing networks is the use of convolutional layers. These layers are designed to be translation-equivariant [
6], which ensures that a temporal shift in the input only affects the output by the same shift. This characteristic enables consistent network responses to temporal patterns, regardless of their temporal location, while reducing the number of model parameters compared to architectures lacking such constraints. For the second path, a variety of strategies have been proposed to learn useful priors from data. One approach is data augmentation, in which time series are transformed while preserving their annotations (labels) to artificially expand the dataset [
7,
8]. Deep neural networks trained on such augmented datasets implicitly learn to become invariant under these transformations, which can lead to better out-of-sample prediction performance. Another strategy is transfer learning [
9], a two-step process in which neural networks are trained on one task using a large dataset (pretraining step [
10]) and then adapted to learn the actual task of interest using another (usually much smaller) dataset (fine-tuning step). A variant of this idea is self-supervised learning [
11,
12], which allows neural networks to be pretrained on large and heterogeneous datasets without explicitly labeled examples. Finally, generative models such as VAEs, GANs, and diffusion models can be used to sample new time series to extend existing datasets [
13,
14,
15]. Such generative models approximate a data distribution and require large heterogeneous datasets for training. While all of these approaches have been demonstrated to be able to improve the performance of neural networks, they still rely on large empirical datasets for training.
Recent advances in computer vision have demonstrated that it is possible to learn effective priors exclusively from synthetic images, which has the potential to significantly reduce the need for large empirical datasets [
16,
17]. Synthetic images for image classification tasks were generated by simple random processes, such as iterated function systems to produce fractals [
17] or random placement of geometric objects to cover an image canvas [
16]. Deep neural networks pretrained on such data were demonstrated to learn useful priors for image classification tasks, yielding competitive performance comparable to pretraining on natural images on various benchmarks [
17]. This remarkable finding highlights the potential of synthetic datasets that can be generated without much computational resources and, theoretically, in unlimited amounts.
Inspired by these advances, we hypothesize that pretraining exclusively on synthetic time series data generated from simple random processes can also yield effective priors for sleep staging. Given the importance of frequencies for sleep stage scoring and other EEG-based applications [
18,
19], we introduce a pretraining method called “frequency pretraining” (FPT) that centers on generating synthetic time series data with specific frequency content. During pretraining, deep neural networks learn to accurately predict the frequencies present in these synthetic time series. Despite the deliberate simplicity of our synthetic data generation process and the inherent domain shift between synthetic and EEG data, we observe that FPT allows a deep neural network to detect sleep stages more accurately than fully supervised training when few samples (few-sample regime) or data from few subjects (few-subject regime) are available for fine-tuning. The success of our method underscores the essential role of frequency content in enabling neural networks to accurately and reliably discern sleep stages. We consider pretraining techniques leveraging synthetic data, like the one we propose, as a promising area of research, offering the potential to develop models in sleep medicine and neuroscience that are particularly suited for scenarios involving small datasets. To facilitate testing and further advancements, we make the source code of our method publicly available [
20].
Contributions
Novel synthetic pretraining approach: We introduce “frequency pretraining” (FPT), a pretraining approach using synthetic time series with random frequency content, eliminating the need for empirical EEG data during pretraining.
Demonstrated data efficiency: We demonstrate superior sleep staging performance of FPT in few-sample and few-subject regimes across three datasets, with comparable results to fully supervised methods when data is abundant.
Analysis of frequency-based priors: We evaluate the role of frequency information in our pretraining task and how synthetic sample diversity affects fine-tuning.
Comparison with self-supervised methods: We benchmark our approach against established self-supervised learning methods, demonstrating that synthetic data pretraining can achieve comparable results without requiring EEG data for pretraining.
2. Results
Our approach, illustrated in
Figure 1 and detailed in
Section 4, is based on a two-phase training process that combines pretraining on synthetic time series with fine-tuning on clinical sleep data. During the pretraining phase, we generate synthetic time series composed of sine waves with random frequencies drawn from predefined frequency ranges (frequency bins). These synthetic signals are then used to train a deep neural network whose feature extractor,
f, learns to extract useful features, while the classifier,
, learns to predict the frequency bins from which the frequencies were drawn to generate the synthetic time series. After pretraining, the feature extractor is transferred to the fine-tuning phase, where it is applied to real EEG and electrooculography (EOG) data to classify sleep stages. In this phase, the model processes sequences of eleven consecutive sleep epochs, indexed
to
, with the feature extractor producing features
from each epoch. Another classifier
then aggregates these features to predict the sleep stage (Wake, N1, N2, N3, REM) for the central sleep epoch
i.
We evaluated our approach on three publicly available datasets,
DODO/H,
Sleep-EDFx, and
ISRUC, which provided data from 276 subjects, including both healthy individuals and those with various medical conditions (see
Section 4.2.1). For sleep staging performance, we tracked the Macro-F1 score, with higher values indicating better classification accuracy across sleep stages. During the pretraining phase, we assessed the model’s ability to predict frequency bins using the Hamming metric and the accuracy (see
Section 4).
2.1. Training Configurations
We compared the performance of pretrained models against the performance of non-pretrained models in scenarios with varying amounts of training data. In particular, we studied the performance of our approach in few-sample and few-subject regimes, where the greatest benefit was expected. Furthermore, we analyzed the priors that the model learned during pretraining and the role of frequency information in the learned features. Finally, we investigated whether these features could be further improved by fine-tuning the feature extractor. To enable these investigations, we created four training configurations.
Fully Supervised. The Fully Supervised training configuration is similar to many existing deep learning approaches for sleep staging [
3] and served as a baseline to compare our pretrained models against. In this configuration, we skipped the pretraining step and trained (fine-tuned) the feature extractor
f and classifier
from scratch using sleep staging data.
Fixed Feature Extractor. We employed the Fixed Feature Extractor configuration to investigate the relevance of the features generated by the pretrained feature extractor for sleep staging. After pretraining the feature extractor f on synthetic data, we kept its model weights and BatchNorm statistics fixed and only fine-tuned the sleep staging classifier on sleep data.
Fine-Tuned Feature Extractor. With this training configuration, we studied (i) how model performance changes when the pretrained feature extractor is allowed to change during fine-tuning and (ii) whether the priors learned during pretraining can prevent overfitting in few-sample or few-subject regimes. As in the previous configuration, we first pretrained the feature extractor f on synthetic data, but then fine-tuned both the feature extractor and the classifier on sleep data without keeping any model weights fixed. Consequently, this configuration is similar to the Fully Supervised configuration with the key distinction that the feature extractor is initialized with pretrained weights.
Untrained Feature Extractor. The Untrained Feature Extractor configuration served as a baseline to study whether our pretraining scheme produces priors that are superior to random weights for sleep staging. We randomly initialized the feature extractor
f using Kaiming normal initialization [
21] and then kept its weights fixed while fine-tuning the classifier
. This approach mirrors the Fixed Feature Extractor configuration, but with a random feature extractor instead of a pretrained one.
2.2. Data Efficiency
To assess the data efficiency of our pretraining method, we compare the performance of the different training configurations when fine-tuned with a reduced amount of training data (low-data regime) or the full training data (high-data regime). We investigated the low-data regime by first pretraining models with the full synthetic dataset. Depending on the training configuration, we then fine-tuned the pretrained or randomly initialized models with the data of 50 randomly sampled sleep staging samples from one subject. The sampling procedure selected sleep staging data without class stratification, and each sleep stage was represented at least once in the reduced datasets. In the high-data regime, we followed the same procedure but fine-tuned models with the full training data instead of a reduced amount of data. We repeated all experiments three times in a 5-fold cross-validation scheme, resulting in 15 training runs for each configuration and dataset. This approach allowed us to estimate the spread of the macro F1 scores when using different model initializations, sleep staging samples, and data folds for training and testing (see
Section 4).
In the low-data regime, we observed that models pretrained with synthetic data outperformed models trained from scratch in sleep stage classification (see
Figure 2). The performance gap between pretrained and non-pretrained models was most pronounced when comparing the fine-tuned feature extractor to the fully supervised configuration, with the former achieving average macro F1 scores that were 0.06–0.07 higher than those of the latter across datasets. When removing the fine-tuning of the pretrained features (fixed feature extractor configuration), our method still yielded macro F1 scores that were 0.01–0.05 higher than those of the fully supervised configuration. Comparing the fixed feature extractor to the untrained feature extractor configuration further highlights the importance of the learned features. Pretraining the feature extractor with synthetic data improved the macro F1 scores by 0.10–0.17 compared to a random initialization. While the macro F1 scores of the training configurations varied between datasets, the general trends observed in the low-data regime were consistent across all three datasets.
In the high-data regime, our pretrained models were on par with fully supervised trained models (
p < 0.05 for a paired TOST test with a margin of
on the DODO/H and ISRUC datasets) and achieved competitive performance in sleep stage classification (see
Figure 2). The fine-tuned feature extractor configuration achieved average macro F1 scores of 0.76–0.81 across datasets, comparable to the macro F1 scores of 0.76–0.80 achieved by the fully supervised configuration. As in the low-data regime, we observed that fine-tuning the pretrained features was beneficial, as the macro F1 scores achieved by the fine-tuned feature extractor configuration were 0.02–0.05 higher than those of the fixed feature extractor configuration. The performance gap between the untrained feature extractor and the fixed feature extractor remained substantial even in the high-data regime, with the former achieving average macro F1 scores that were 0.08–0.12 lower than those of the latter.
2.3. Impact of Subject Diversity and Number of Training Samples
We further explored the data efficiency of our pretraining method by investigating how model performance is affected by subject diversity and sample volume (i.e., number of samples) in the training data used for fine-tuning. To study the effect of subject diversity independent of sample volume, we separately varied the number of subjects and the number of samples in the training data. The number of subjects was randomly sampled as , and the number of samples was randomly sampled from those subjects as (“all” indicates that all available training samples were used). Our sampling strategy selected sleep staging data without class stratification, and each sleep stage was represented at least once in the reduced datasets. For each parameter combination, we trained the fully supervised and fine-tuned feature extractor configurations on all three datasets with three repetitions of 5-fold cross-validation.
In our results, we observed that both the fully supervised and the fine-tuned feature extractor configurations benefited from increased subject diversity, even when the total number of training samples was held constant (see rows in
Figure 3a–f). Similarly, both configurations benefited from an increased number of training samples when the number of subjects was held constant (see columns in
Figure 3a–f). For the considered parameter ranges, the impact of reduced subject diversity and reduced sample volume on model performance appeared to be comparable.
Similar to the observations made in
Section 2.2, the fine-tuned feature extractor configuration achieved better macro F1 scores than the fully supervised configuration in the few-sample regime. The performance gap between the two configurations was most evident when the number of samples was limited to 50, with bootstrap estimates of the mean differences in macro F1 scores ranging from 0.05 to 0.09 across datasets and subject numbers (see
Figure 3g–i). These differences in macro F1 scores decreased as the number of samples increased. When training on all available training samples, the fine-tuned feature extractor configuration achieving comparable or slightly better performance than the fully supervised configuration. Interestingly, the performance gap between the two configurations with all available training samples was most pronounced for the DODO/H dataset. The fine-tuned feature extractor configuration achieved average macro F1 scores that were 0.02–0.06 higher than those of the fully supervised configuration. For the Sleep-EDFx and ISRUC datasets, the performance differences between the two configurations with all available training samples were minimal at 0.00–0.01.
Depending on the dataset, the fine-tuned feature extractor configuration showed varying degrees of improvement over the fully supervised configuration in the few-subject regime. When training with data of a single subject, the fine-tuned feature extractor configuration achieved average macro F1 scores that were 0.06–0.11 higher across sample numbers than those of the fully supervised configuration for the DODO/H dataset (see rows in
Figure 3g). This performance gap decreased as the number of subjects increased, with mean bootstrapped differences in macro F1 scores between 0.02 and 0.09 for five subjects. In contrast, varying the number of subjects had less impact on the performance gap between the fine-tuned feature extractor and the fully supervised configurations for the Sleep-EDFx and ISRUC datasets (see rows in
Figure 3h–i). The improvements achieved by the fine-tuned feature extractor configuration when trained with one subject (0.01–0.05 for the Sleep-EDFx dataset and 0.00–0.07 for the ISRUC dataset) were comparable to those achieved when trained with five subjects (0.00–0.06 for the Sleep-EDFx dataset and 0.01–0.06 for the ISRUC dataset).
2.4. Priors Towards Frequency Information
To get a better understanding of the pretraining process and the priors learned by the model, we recorded several metrics during pretraining. The recorded loss function converged to a low value, indicating that the model has learned effectively (see
Figure 4a). At the same time, the Hamming metric reached a high value of 0.9 on the validation data (see
Figure 4b). This value can be interpreted as the model predicting the frequency bins that were used to create the synthetic signals with an accuracy of 90%. The model was particularly proficient in predicting higher frequencies starting from 2.5 Hz (accuracies
; see
Figure 4c). Lower frequencies, especially those below 1 Hz, were predicted with lower accuracy (75–85%).
We hypothesize that the differences in prediction accuracy across frequency bins are not due to the pretrained model being unable to predict lower frequencies. Instead, we believe this discrepancy arises from the varying width of the frequency bins (see x-axis of
Figure 4c) as the bins for lower frequencies were narrower than those for higher frequencies due to their logarithmic scaling. While we applied a logarithmic binning scheme to increase our model’s focus on the lower frequencies important for identifying slow wave sleep (N2 and N3 sleep), the narrow low frequency bands increase the difficulty of the pretraining task leading to decreased accuracies. Preliminary exploration of this trade-off between model focus and task difficulty showed that the current binning scheme slightly outperforms a linear binning scheme in the fixed feature extractor setup.
Interestingly, the ability of the pretrained feature extractor to extract useful features for sleep staging was strongly influenced by the diversity of the synthetic pretraining data (see
Figure 4d). We investigated this influence of sample diversity by pretraining models with varying amounts of different synthetic samples
. To isolate the effect of sample diversity from the effect that the number of training steps has on model performance, we kept the number of gradient updates per training epoch constant by under- or oversampling the synthetic data to 100,000 samples. The pretrained models were then trained for sleep staging using data from one random subject from the DODO/H dataset and evaluated on the validation data of each cross-validation fold. As in the previous experiments, we performed three repetitions of a 5-fold cross-validation scheme for each number of synthetic samples.
Both the fixed feature extractor and the fine-tuned feature extractor configurations required a high level of diversity in the synthetic pretraining data to achieve good sleep staging performance (see
Figure 4d). When pretrained with only one synthetic sample, the performance of the two training configurations with pretraining differed only slightly from the performance of the configurations without pretraining (untrained feature extractor and fully supervised configurations). As the number of synthetic samples was increased to more than 100, the performance of the pretrained models improved substantially until reaching a plateau at around 10,000 samples. We hypothesize that this plateau was reached because the model had learned all the relevant features from the pretraining task, and additional samples did not provide any further benefit. The simplicity of the pretraining task could also explain the negligible performance differences between the pretrained configurations with less than 100 synthetic samples and the training configurations without pretraining. For such few synthetic samples, the model may have memorized the synthetic data rather than learn general features useful for sleep staging.
2.5. Comparison to Self-Supervised Methods
We compared our frequency pretraining approach (FPT) with two popular self-supervised learning (SSL) methods, SimCLR [
22] and VICReg [
23], using the same model architecture as in FPT. The projection head (SimCLR) and expander (VICReg) followed their original designs [
22,
23]. Different from the conceptually simple multi-label classification task in FPT, both SSL methods follow the objective of contracting representations of data-augmented “positive” samples while keeping representations of dissimilar “negative” samples far apart (see original works for details [
22,
23]). Positive samples were generated using standard data augmentations: amplitude scaling (random factor 0.5–2), Gaussian noise injection (
= 0.05), random temporal masking (10 segments of 1.5–3 s), time shifting (up to ±1.5 s), and time warping (random factor 0.67–1.5). We evaluated two pretraining configurations: pretraining on (i) synthetic data (as in FPT) and (ii) EEG recordings from the ISRUC and Sleep-EDFx datasets. Fine-tuning was performed on DODO/H datasets under low-data (50 random epochs from one random subject) and high-data (all available training data) regimes, respectively.
When pretrained on synthetic data, both SimCLR and VICReg achieved performance comparable to the FPT method, with FPT slightly outperforming the SSL methods in the high-data regime when the feature extractor was fine-tuned (average MF1 scores of 0.80 versus 0.81, see
Table 1). Pretraining on EEG data yielded only modest improvements, with both SSL methods showing slightly better performance in the low-data regime when the feature extractor was fixed (difference in average MF1 scores of 0.02) and in the high-data regime when fine-tuning the feature extractor (difference in average MF1 scores of 0.01). Compared to the EEG-pretrained SimCLR and VICReg models in the high-data regime, FPT showed similar performance (average MF1 scores of 0.76 and 0.81) but, unlike the two SSL methods, did not require EEG data for pretraining. In the low-data regime, the SSL methods performed comparable to the FPT method when fine-tuning the feature extractor and achieved slightly higher scores than FPT when keeping the feature extractor fixed (average MF1 scores of 0.44 and 0.45 versus 0.42). None of the differences in MF1 scores between the SSL methods and FPT were significant according to a two-sided Wilcoxon signed-rank test (
p-values > 0.05).
3. Discussion
In this work, we propose a novel pretraining scheme for EEG time series data that leverages synthetic data generated by a simple random process. We specialized the hyperparameters of the pretraining task to the typical frequency range and distribution of sleep EEG signals and demonstrated the effectiveness of this task for sleep stage classification. Due to the availability of several open sleep staging datasets [
24,
25], we were able to fully control the amount and diversity of the training data, which allowed us to study the impact of our method in different data regimes. We hypothesize that our pretraining scheme could be particularly beneficial in few-sample and few-subject regimes, which we argue could benefit greatly from the priors towards frequency information that a model learns during pretraining.
Our results confirm the effectiveness of our pretraining scheme, particularly in few-sample and few-subject regimes. Pretrained models outperformed non-pretrained models when fine-tuned with a reduced number of subjects or training samples (see rows and columns in
Figure 3g–i, respectively). The performance gap between pretrained and non-pretrained models was most pronounced in the few-sample regime, where pretrained models consistently achieved improvements over non-pretrained models across multiple datasets. In the few-subject regime, this performance gap was not as consistent across datasets, with our pretraining method showing the most substantial improvements in the DODO/H dataset. Our findings support observations made in the field of Self-Supervised Learning (SSL) that pretrained models generally have better data efficiency than fully supervised ones [
12,
26]. In contrast to SSL methods, however, our pretraining scheme improves data efficiency without requiring empirical data, while achieving comparable or only slightly reduced performance (see
Table 1). Interestingly, we observed that two data-augmentation-based SSL methods performed well even when pretrained with synthetic data instead of EEG data (see
Table 1), suggesting that SSL approaches are promising, yet computationally intensive, alternatives to the FPT method. Further exploration of the use of SSL methods for pretraining on synthetic data is warranted and seems promising. Generating synthetic data can be much cheaper and more cost-effective than collecting EEG data, as generating 10,000 samples of the synthetic data used in this study required only
s of a single consumer-grade CPU core (Lenovo Legion S7 16IAH7 Laptop with an Intel
® Core
™ i7-12700H CPU).
We hypothesize that the potential of synthetic data stems from the priors that the model learns during pretraining. These priors could prevent overfitting to a small number of training samples, particularly those from minority classes (e.g., N1 sleep), or subject-specific features, which is especially problematic in situations with very little training data. As expected, we observed that all of our training configurations improved with a larger training dataset (see
Figure 2). This aligns with the prevalent view in the literature that deep learning models for sleep staging need substantial amounts of diverse data to perform well [
3,
4,
27,
28]. When trained with the full training data, pretrained models performed comparably to fully supervised models (see
Figure 2), achieving macro F1 scores similar to those of other deep learning approaches for sleep staging [
3,
29]. In conclusion, our pretraining method was most beneficial in situations with limited training data, where it outperformed models trained from scratch, but had less impact in situations with large amounts of training data.
We further observed that, while the frequency content of a signal is crucial for sleep staging, deep neural networks extract additional information from the data that exceeds the frequency domain. The importance of the frequency content of a signal for sleep staging is demonstrated by the high macro F1 scores achieved by the fixed feature extractor configuration and the substantial performance improvements it achieved over untrained feature extractors (see
Figure 2). We attribute the performance gap between the two training configurations to the priors learned during pretraining the feature extractor. These priors biased the model to extract frequency information from the data, which it achieved with high accuracy after pretraining (see
Figure 4b,c). Our finding is consistent with previous studies that reported frequency-based features to be important for sleep staging [
19]. When the feature extractor of our model was allowed to be fine-tuned after pretraining, model performance increased (see
Figure 2). We hypothesize that this increase in the macro F1 score is due to the feature extractor learning to extract information beyond the frequency content of the signal during fine-tuning. This hypothesis is in line with the AASM annotation guidelines [
18], which consider several frequency-unrelated features essential for sleep staging. These features include time-domain information, which is important for spindles as well as amplitudes and specific patterns, such as k-Complexes [
4]. Recent studies that applied feature engineering approaches to sleep staging further support our hypothesis by including additional features from the time-domain in the models used [
30,
31]. In their work, Vallat and Walker analyzed the most important features for their model and found that time-related features, such as the time elapsed from the beginning of a recording, were among the top 20 most important features [
30]. Although it remains unclear what additional information our pretrained models learn during fine-tuning, our method offers a promising avenue for future research into the interpretability of deep neural networks for sleep staging.
There are several opportunities for future work that could build upon our findings. One promising direction is to explore the pretraining task in more detail, for example, by investigating the synthetic data generation process and the impact of changing the used frequency range. Similar to previous work in the vision domain [
16], it could also be promising to investigate which structural properties of synthetic time series are important for sleep staging. This could be achieved by defining new pretraining tasks that are based on different data generation processes that incorporate more complex structures like desynchronized phases across channels, noise, or polymorphic amplitude variations. Exploring such data generation processes may lead to a better understanding of what constitutes “natural” EEG time series and what information, besides the frequency content, is essential for sleep staging. In addition, we suggest exploring models with greater capacity and less inductive bias than the CNN-based architecture used in this work, such as transformer models [
32,
33], which we expect to benefit even more from our pretraining method. Pretraining such models with synthetic data may alleviate their need for large amounts of training data [
34]. Another avenue for future research is to investigate whether our pretraining method is beneficial for specific cohorts of subjects, such as patients with a specific disorder or specific age groups. Although we did investigate datasets with different demographics in this work, we did not perform detailed analyses of the impact of these demographics on model performance. Finally, it could be insightful to compare our approach with a broader range of SSL methods [
11] and data augmentation strategies that employ synthetic EEG generators [
7,
13]. To enable such comparisons and to facilitate future research in this direction, we make our code available online [
20].
Our method presents a novel solution to address important issues that affect current deep learning models in the EEG time series domain, without requiring large amounts of patient data. We expect our approach to be advantageous in various applications where EEG data is scarce or derived from a limited number of subjects, such as brain–computer interfaces [
2] or neurological disorder detection [
35].