1. Introduction
Sleep disorders represent an increasingly significant public health concern, with substantial impacts on both physical and mental well-being. Accurate diagnosis generally requires overnight monitoring in specialized facilities through PSG, which, although regarded as the gold standard, is resource-intensive, invasive, and often uncomfortable for patients. The rising demand for sleep assessments has placed significant pressure on healthcare infrastructure and healthcare systems, particularly in hospital settings, where bed availability and staffing resources are limited.
In contrast, cost-effective, off-the-shelf devices can be comfortably worn or installed at home, enabling the collection of large volumes of data suitable for the remote monitoring of users’ health conditions. The fundamental challenge resides in extracting meaningful information from these data while addressing critical aspects such as detection accuracy, data availability and privacy for model training, computational performance, and overall cost.
This study presents a preliminary investigation into the use of a non-invasive BCG device for detecting sleep disorders. BCG technology enables contactless monitoring of cardiac and respiratory activity by capturing mechanical vibrations transmitted through the body, thus offering a promising alternative for both continuous and event-based sleep monitoring. The primary objective is to evaluate the feasibility of employing such a device for home-based sleep monitoring, which could substantially reduce the workload of clinical facilities by facilitating remote assessment and early diagnosis. This approach not only enhances patient comfort, by obviating the necessity for repeated hospital visits, but also contributes to more efficient resource allocation within healthcare institutions.
The proposed approach leverages artificial intelligence (AI) techniques with the aim of enabling, in the future, the transfer of knowledge from high-end medical devices, capable of extracting detailed physiological information, to models trained on data collected from low-cost, off-the-shelf sensors. Rather than focusing on the absolute diagnostic accuracy of medical-grade equipment, this study emphasizes the evaluation of general indicators of patient well-being.
To validate the proposed methodology, we conducted a real-world case study on apnea detection using BCG signals, training a deep learning algorithm on data automatically labeled. Preliminary experiments were performed on the dataset introduced in [
1], extending the methodology with automatic data labeling and proposing novel deep learning architectures. Automatic labeling strategies were evaluated and compared against manual annotations to assess the generalization capability and robustness of the proposed framework.
Section 2 discusses related studies, while
Section 3 outlines the overall methodology. The open dataset and the BCG labeling process are described in
Section 4.
Section 5 presents the technique for the automatic detection of apnea events from BCG signals. Data windowing and the neural network architecture are described in
Section 6. Experimental results are discussed in
Section 7.
Section 8 analyzes the performance. Finally, conclusions are drawn in
Section 9.
2. Related Work
A good night’s sleep is essential for maintaining psychophysical well-being. Lack of sleep or poor-quality sleep can contribute to the development of chronic conditions, including diabetes, cardiovascular disorders, kidney problems, anxiety, and depression [
2]. The technological landscape of sleep monitoring encompasses a wide range of approaches and devices. Yin et al. [
3] presented a comprehensive overview of long-term sleep monitoring technologies, categorizing physiological signals into three main types: bioelectrical, biomechanical, and biochemical.
PSG [
4] remains the definitive gold standard for the clinical diagnosis of sleep disorders. It involves the use of a traditional device called a polysomnograph, which allows for the simultaneous monitoring of various physiological parameters, including brain activity (EEG), eye movements (EOG), muscle tone (EMG), heart rate (ECG), respiration, oxygen saturation, and body position. This device provides a comprehensive assessment of the patient’s neurological and cardiovascular status. Despite its diagnostic accuracy, PSG has notable limitations when used over extended periods or in home environments. The need to apply multiple electrodes can be uncomfortable for patients, which may adversely impact sleep quality and make home adoption challenging. These drawbacks have driven research toward alternative, less invasive solutions that are more suitable for daily and long-term monitoring.
In recent years, numerous studies have explored alternative technologies for sleep monitoring that do not require the use of PSG, proposing systems based on wearable sensors, piezoelectric sensors [
5], multimodal sensors [
6], and pulsed ultra-wideband (IR-UWB) radar technology [
7]. In this context, ref. [
8] introduced a contactless bed sensor for sleep apnea detection, providing a comparative study that validates its effectiveness against standard methods. The BCG is a non-invasive diagnostic technique that measures low-frequency oscillations generated by cardiac activity, enabling the analysis of events related to the cardiac cycle, such as the duration and intensity of heartbeats. This method is based on Newtonian mechanics and captures the mechanical forces produced by blood flow dynamics and myocardial contractions. Modern BCG has proven especially valuable for the early detection of cardiac anomalies, often identifying functional alterations even before obvious clinical symptoms appear. The technique relies on the analysis of characteristic BCG waves (G, H, I, J, K), each corresponding to specific events in the cardiac cycle, enabling precise differentiation between systolic and diastolic phases. This temporal correlation makes BCG a particularly useful tool for studying cardiovascular dynamics.
The article [
9] reviews several technological solutions for acquiring the BCG signal. Several studies have investigated the potential of BCG as a non-invasive alternative to electrocardiography (ECG), demonstrating significant potential. One of the most notable studies in this area [
10] compared heart rate measurements derived from the J wave of the BCG signal with those obtained from ECG, both recorded simultaneously using a BIOPAC system. Signal processing of the BCG data, performed using wavelet transform techniques, achieved an accuracy of 93%, demonstrating that BCG can effectively rival ECG for heart rate monitoring. At a more advanced level, Morokuma et al. [
11] developed a deep-learning-based system capable of reconstructing ECG signals from BCG data. Leveraging a bidirectional LSTM neural network, the model estimated R–R intervals with an average error of only 0.034 s, paving the way for long-term, electrode-free cardiac monitoring, particularly useful during sleep. There are several case studies dealing with sleep monitoring using BCG, with particular attention to situations where PSG is too invasive or impractical. The study by [
12] focused on children with severe autism, proposing an instrumental system installed in the bed capable of continuously monitoring physiological parameters during the night.
The main goal is to provide an objective and non-invasive assessment of sleep quality, minimizing discomfort in patients with complex neurological disabilities, for whom traditional devices such as ECG or PSG are not compatible with their behavioral needs. In the work of [
13], an innovative approach for the detection of heart failure (HF) is presented based on the integration of BCG and respiratory signals analyzed through machine learning techniques. This system, designed for home monitoring, allows an early and non-invasive diagnosis, representing an economical and practical alternative to hospital investigations. Another case study is presented in [
14], where a real-time cardiac monitoring system is proposed, capable of detecting abnormal heartbeats and generating automatic alerts.
The study presented in [
1] proposes an automatic system for detecting breathing disorders during sleep, based on the analysis of BCG signals using a convolutional neural network (CNN). The approach is entirely non-invasive, thanks to the use of BCG sensors placed under the mattress, and it is independent of the sleeper’s body position. The paper begins by presenting the classical method for identifying the R-peak in the ECG signal, a crucial step for the synchronization and analysis of cardiac signals. It then introduces a formulation based on Cartan curvatures, which is used to describe and model the geometric characteristics of the BCG signal. This representation provides meaningful features for training the model and for the automatic detection of abnormal breathing events.
Extending and refining the methodology defined in [
1], we introduce the automatic identification and labeling of abnormal breathing events in the high-resolution BCG signal to overcome the limitation of the proposed approach at a large scale, due to the excessive effort related to manual annotation of the ECG trace. Moreover, the integrated utilization of automatic unsupervised training and federated learning techniques allows for a continuous improvement of the model. Also, the deep learning model is different with respect to the one proposed in [
1], because it does not take the transformation of the signals into an image as input, but works directly with the original values.
In [
15], the use of low-cost off-the-shelf devices for training AI models on BCG data was investigated. In [
16], the effectiveness of federated learning, compared with a centralized approach, was evaluated for detecting breathing anomalies during sleep on ECG signals.
This paper focuses on assessing the effectiveness of our AI model applied to BCG signals, evaluating its performance using a centralized approach on a public dataset. The ultimate aim is to enable the replacement of high-end medical devices with low-cost devices.
3. Methodology
PSG currently represents the gold standard for the recognition and annotation of apnea events during sleep. In this work, we aim to simplify the monitoring process both in the hospital setting, by detecting apnea events using only the ECG signal while simultaneously recording the BCG signal, and, more significantly, in the home environment, where monitoring is performed exclusively using BCG. This paradigm is equivalent to developing a robust model within a highly equipped laboratory (the hospital) and subsequently deploying it for everyday use in real-world environments (at home), ensuring that the knowledge acquired under clinical conditions remains effective and generalizable. The following section describes the complete methodological workflow, as illustrated in
Figure 1.
The process begins within the hospital environment, where sleep-related physiological signals are recorded using high-precision medical devices, specifically an ECG. The ECG provides a clinically validated reference for detecting apnea events and serves as the gold-standard supervisory signal for subsequent analyses. Simultaneously, patients are equipped with low-cost devices, such as a BCG belt, capable of capturing indirect signals associated with cardiac and respiratory activity. Through synchronized monitoring, the data collected by the low-cost device can be temporally aligned with the ECG recordings, which are one of the signals recorded in the PSG, along with corresponding clinical annotations. This alignment enables the automatic labeling of low-cost BCG signals using the ECG as a supervisory reference to generate pseudo-labels, allowing the weakly supervised training of non-invasive and easily accessible monitoring systems. Subsequently, an artificial intelligence model is trained via supervised learning to recognize apnea-relevant patterns, using clinically validated annotations from the high-precision medical device as the gold standard.
4. BCG Analysis and Labeling
Automatic identification of obstructive sleep apnea (OSA) [
17] refers to a disorder characterized by repeated interruptions in breathing during sleep. Traditional diagnosis, based on PSG, is accurate but expensive and poorly suited for home monitoring. This motivates the investigation of transferring knowledge from ECG signals (widely recognized for their effectiveness in detecting apnea-related cardiac alterations) to BCG signals, which are less informative but more practical for home-monitoring applications. The objective is to develop a predictive model based solely on BCG data. This approach aims to enable the replacement of ECG with BCG for home-based apnea detection while maintaining a high level of diagnostic accuracy, even in the absence of direct cardiac electrical measurements. It is important to note that the dataset used in this study was collected during controlled breath-holding sessions rather than overnight PSG-based OSA assessments.
4.1. Public Dataset
The dataset used in this study [
18] was published in 2020 and made available via the Mendeley Data Repository (
https://data.mendeley.com/datasets/9fmfn6kfn7/1 (accessed on 25 March 2026)). It was developed as part of a research project of the Faculty of Informatics and Management of the University of Hradec Králové, as well as the PERSONMED project - Center for the Development of Personalized Medicine in Age-Related Diseases.
The dataset is particularly suitable for our case study for the following reasons.
Synchronized ECG and BCG acquisition: The dataset contains simultaneous recordings of 12 BCG signals (acquired via a bedside platform with strain gauge sensors) and a reference ECG signal, sampled at 1 kHz with a 24-bit converter. This synchronization is essential for studying correlations between the two signals.
Presence of controlled breath-holding events: Participants performed voluntary breath-holding sessions, each lasting approximately 30 s. These events were manually annotated by experts, providing reliable time labels for training and validating classification models.
Table 1 presents one of two experimental protocols, V1 or V2, which correspond to different acquisition procedures. A detailed description of the events performed by the volunteers, including the exact duration (in seconds) of each event, is provided in that table.
Well-organized data structure: Each measurement is represented as a temporal matrix with 14 columns—a validity flag, the 12 BCG signals, and the ECG signal. This format facilitates processing and temporal alignment between signals.
4.2. Labeling Procedure
The labeling procedure was carefully designed to produce high-quality annotations for training a CNN to analyze multi-channel BCG signals available in the selected dataset. A moving window was used to extract training samples from the original data. Each sample corresponds to 30 s (30,000 points) of the recorded signals, including 12 synchronized BCG signals recorded simultaneously at a sampling rate of 1 kHz from various sensor positions or orientations, as detailed in [
18]. No preliminary signal processing was executed to remove noise or artifacts before labeling, ensuring that annotations reflect the raw characteristics of the signals. For each sample, the target label is computed by processing all twelve BCG synchronized signals in the same window.
In
Figure 2, the top panel depicts one of the twelve BCG signals from the original dataset. A segment of the BCG signal, ranging from sample 300,000 to 480,000, was replaced with zeros to mitigate the influence of significant motion artifacts caused by subject movement during recording without affecting the feasibility of the approach; in fact, such an interval can be easily identified by the BCG device. This zeroing procedure was applied only to the signals belonging to the V1 subset of the dataset. This segment will therefore remain unlabeled. Two overlapping instances of the moving window are depicted with solid and dashed red lines.
The bottom panel of
Figure 2 illustrates the User-Defined Function (UDF) employed in the labeling process (blue line), as defined in Equation (
1). The value of
is set to align each UDF with the right boundary of the apnea intervals defined in the original study. Meanwhile,
is calculated to normalize the area under the curve to unity.
Moreover, the UDF function is constructed such that a 30 s sliding window, centered on the peaks of the UDF itself, yields an integral value close to one. The synchronization between the BCG and the UDF adheres to the annotation protocol described in [
1], which maps specific ECG segments to apnea-related events. The integral of the UDF function, i.e., the area under the curve, measures the probability of finding an apnea event in the window. Taking into account that the American Academy of Sleep Medicine (AASM) [
19] defines apnea as a complete or near-complete (≥90%) cessation of airflow lasting at least 10 s in adults, the UDF starts to grow after 10 s of breathing pause and increases until the subject restarts breathing. In fact, the maximum impact in terms of ECG anomaly is expected to be observed right before and after the end of the apnea interval.
Labeling was performed by integrating the UDF function over the sliding windows aligned with the BCG signal. Each resulting integration value was then used to label the corresponding 30 s BCG segment (30,000 points) across all 12 channels, based on the range, or basket, in which the integration value fell.
For example, consider a binary classification problem with two possible labels, 0 and 1. In this case, the integration result of the UDF function can fall into one of two intervals: . Each sample is assigned a label according to the interval in which its UDF integration value lies. This method can be seamlessly extended to support both binary and multiclass labeling of BCG data.
Figure 3 provides a detailed view of the labeling process illustrated in
Figure 2. Each labeled window corresponds to a fixed 30 s segment of BCG signals. By employing a sliding-window approach with adjustable stride parameters, the number of labeled training samples can be increased. This procedure produces a well-structured set of input–output pairs suitable for supervised learning, allowing the CNN to effectively capture and model the spatio-temporal features inherent in multi-channel BCG signals.
5. Automatic Apnea Detection
To enable unsupervised learning, the system must be able to automatically detect apnea episodes from BCG signals. To this end, we first conducted a preliminary analysis on ECG signals (the most representative of cardiac activity), which serve as a reference. ECG analysis allows us to visually determine the correct placement of the maximum of the unsupervised discriminant function, which is subsequently used to generate automatic labels for training the network on BCG data.
During apnea, the temporary interruption of respiration triggers distinct autonomic and cardiac perturbations. To quantify these changes, we introduce three ECG-derived metrics capturing such physiological variations.
Heart Rate Variation (DHR) quantifies instantaneous changes in heart rate using RR intervals. A standard metric is the standard deviation of RR intervals (SDNN):
where
is the
i-th RR interval and
its mean value. Heart-rate variation can also be expressed as:
A decrease in HRV and DHR reflects autonomic imbalance and is often associated with apnea [
20,
21].
LF/HF ratio quantifies autonomic balance through the power spectral density of RR intervals:
where
and
are the spectral powers of the low-frequency (0.04–0.15 Hz) and high-frequency (0.15–0.4 Hz) bands, respectively. An elevated LF/HF ratio serves as a marker for sympathetic predominance, typically observed during apnea [
20,
21].
ECG-Derived Respiration (EDR) is a respiratory surrogate extracted from the ECG, often from R-peak amplitude modulation:
where
is the amplitude of the
n-th R-peak. An alternative respiration-related metric is:
EDR reflects respiration-dependent cardiac modulation and is highly sensitive to apnea-related reductions in breathing effort [
20].
Table 2 presents the main ECG-derived metrics and their expected changes during apnea events.
In the original dataset, each recording has ECG signals sampled at 1 kHz without any preprocessing or artifact removal. To compute the introduced parameters, a first key pre-processing step is to identify peaks (R-peaks) of the ECG signal, where R-R intervals are obtained.
To compute the DHR parameter, the ECG signal is filtered between 5 and 15 Hz to enhance R-peaks, which are detected using amplitude analysis. RR intervals are computed and converted into heart rate values (bpm). DHR is calculated as the absolute value of the difference between consecutive heart rate samples, and can be smoothed to reduce noise.
To compute the LF/HF parameter, the heart rate time series is interpolated at 4 Hz to obtain uniformly spaced samples. A moving window of 60 s with a 1 s step length extracts segments for power spectral density estimation via Welch’s method. The power is integrated over LF (0.04–0.15 Hz) and HF (0.15–0.4 Hz) bands, and their ratio (LF/HF) is computed, exactly as described.
Finally, to compute the EDR signal, R-peak amplitudes from the filtered ECG are extracted to form the raw EDR signal, which is then interpolated to create a continuous respiratory surrogate. The code also includes detection of respiratory pauses based on low-amplitude variability over a minimum pause duration. This matches the qualitative description of EDR computation and apnea event detection.
In
Figure 4, the values of the introduced parameters were computed for an ECG trace extracted from the dataset. It can be observed that during each apnea interval, there is typically an increasing trend in all computed parameters, with the maximum values varying according to the individual physiological response of the subject. For example, the LF/HF signal demonstrates that the impact of apnea intensifies with each successive event, as the subject is unable to fully recover to the baseline state. Moreover, the peak effect of the phenomenon does not coincide exactly with the onset (right border) of the apnea interval. In
Figure 4, LF/HF and EDR peaks provide the most reliable diagnostic indicators, although this may vary between traces. Specifically, the
y-axis of the EDR signal represents ADC counts, corresponding to a bipolar representation of the 24-bit ADC output. Artifacts caused by user movement, as well as individual variations in physiological response and reactivity, may also influence signal recordings and introduce errors during testing.
5.1. Performance of Detection
To facilitate apnea event recognition, each parameter was first smoothed with a 5 s moving-average filter. Peaks and sustained elevations were then detected on the smoothed signals using feature-specific thresholds. Event candidates were evaluated in a running window of length W (30–40 s depending on the feature) slid with a 1 s step across the recording. For peak-based methods (DHR and EDR), a min-peaks criterion required at least N peaks above threshold within the window (e.g., N = 3 in W = 30–40 s). For LF/HF, a sustained-elevation rule enforced a minimum duration of 10 s above threshold. Detected events were matched to ground truth using overlap (for LF/HF) or a ±30 s proximity tolerance (for peak detections). Finally, we quantified temporal displacement as the difference between the time of maximum response and the right border of the matched apnea interval. This multi-parameter consensus successfully attenuated the false-positive rate compared with single-indicator detection, and the resulting detection error must be added to the classification error of the neural model.
5.2. Event Detection Performance
To identify optimal detection parameters for apnea event recognition, we performed a comprehensive threshold optimization analysis across ECG-derived physiological indicators using labeled apnea intervals. Three parameters were evaluated: delta heart rate (DHR), the low-frequency to high-frequency power ratio (LF/HF), and ECG-derived respiration (EDR). For each parameter, we systematically searched the parameter space to maximize event-based F1 score, where events were matched to ground truth intervals using temporal overlap for sustained elevations or proximity tolerance ( s) for peak detections.
Two complementary strategies were evaluated: (i) peak-based detection on DHR and EDR using a global amplitude threshold to mark transient HR accelerations, and to capture respiration-related envelopes, and (ii) sustained-elevation detection on LF/HF using a threshold plus a minimum duration to model gradual autonomic activation during apneic episodes. To reduce false positives in peak-based methods, a “min-peaks in window” criterion was added, requiring at least N peaks above threshold within W seconds to promote temporally coherent bursts over isolated excursions that often reflect noise or single ectopic beats.
Temporal displacement was quantified for each matched event as the difference between the event’s time of maximum response and the right (end) border of its corresponding ground-truth interval, summarizing the systematic anticipation or lag of autonomic/respiratory markers relative to episode termination.
For each parameter and method, grid searches were conducted across threshold ranges derived from feature percentiles and, where applicable, across minimum duration (LF/HF) or min-peaks/window hyperparameters (DHR/EDR), selecting configurations that maximized event-level F1 on the combined dataset.
Table 3 includes aggregate performance, reporting precision/recall trade-offs, enabling fair cross-feature comparison and multi-parameter fusion design for practical ECG-based apnea screening.
The LF/HF ratio and DHR methods demonstrated superior performance. The sustained elevation approach for LF/HF proved particularly effective as it captures the gradual sympathovagal imbalance characteristic of apnea episodes, whereas instantaneous peak-based metrics are more susceptible to transient physiological fluctuations unrelated to respiratory events. On the other hand, considering multiple DHR peaks helps to increase precision, providing excellent balance between precision and recall. These results demonstrate clinical viability with 84% precision and 80% recall for DHR with the minPeaks strategy, F1 = 0.82; and practicality with only 16% false alarm rate. The temporal displacement analysis revealed that the mean time displacement of the detected event is 17 s prior to the apnea interval termination, with a standard deviation of 8 s. This displacement is computed on ECG-based detections (DHR minPeaks) relative to the right border of the annotated apnea interval. We will use these values to estimate the real accuracy of the BCG-based detection method.
6. BCG Signal-Based Deep Learning Classification
The 12-channel BCG signal sampled at = 100 Hz has segmented into fixed-length windows of duration T = 30 s and stride S samples; each window is mapped by a deep network to a posterior probability of apnea, and labels are assigned by the presence of any annotated apnea within the window. The stride S is tuned by validation, and the final model is evaluated at the stride yielding the best validation performance, including an event-level time-displacement error to quantify detection timing relative to ground truth.
Let be the multichannel signal acquired from the four bed supports, each equipped with a 3-axis sensor measuring the components x, y, and z. Hence, the total number of channels is . The signal is sampled at , and its discrete-time representation is , .
For a fixed window duration
and stride
S (in samples), the
k-th analysis window is defined as:
The corresponding network input for window
k is the tensor:
obtained by stacking the
channels (three per bed foot) over the indices in
.
The total number of windows extracted with stride
S is given by:
Let the set of apnea intervals be , expressed in sample indices (or equivalently in seconds by dividing by ). A window is labeled as positive, i.e., affected by apnea, if the output of the integration step of the UDF function falls within the interval [0.5, 1].
A deep network maps to a posterior of apnea in window k.
To study the effect of temporal sampling density, models are trained for stride values
(equivalently
s) in 500, 10,000, 15,000. Let
be a validation metric (e.g., event-level
or average precision). The selected stride is
Neural Network Architecture
The proposed model architecture, illustrated in
Figure 5, is a lightweight one-dimensional convolutional neural network (1D-CNN) designed for efficient feature extraction and classification from sequential data. The network follows a streamlined hierarchical structure, progressively transforming the input signal into a compact representation through a series of optimized layers.
The architecture begins with an input layer that receives the raw one-dimensional signal. This is followed by three successive convolutional blocks Conv1–Conv3. Each block consists of a 1D convolutional layer with a kernel size of 3, a stride of 2, and padding of 1, followed by batch normalization (BN) and a ReLU activation function. Unlike traditional deep models, this architecture utilizes a minimal number of filters—increasing from 1 to 3 across the blocks—effectively reducing the computational footprint while capturing essential temporal patterns through strided convolutions.
Following the third convolutional block, the feature maps are flattened into a 1D tensor to be processed by a sequence of three fully connected (FC) layers. The first two layers (
FC1 and
FC2 in
Figure 5) comprise 20 and 10 neurons, respectively, both employing ReLU activation functions to introduce non-linearity. This progression refines the high-level features extracted by the convolutional front-end into a low-dimensional discriminative space.
The network generates its final output through the last fully connected layer (
FC3 in
Figure 5), which maps the learned representations to the target number of classes. This concluding stage provides the logits necessary for multi-class classification.
In contrast to high-capacity models, this architecture is specifically engineered to minimize computational overhead. With only 225,330 trainable parameters and a memory footprint of approximately 0.86 MB, the model ensures high efficiency and rapid inference while maintaining the learning capacity required for robust performance.
7. Experimental Results
The dataset was derived from the original BCG signals collected under controlled breath-holding protocols. A sliding window with a specific stride between consecutive windows was applied to the signals. To evaluate the impact of window overlap, the stride was varied from 5 s to 15 s. After generating the windowed samples, the data were divided into training and test sets using an 80–20% split. Training was conducted for 15 epochs, at which point the model reached convergence. Furthermore, a group split was implemented to ensure that data from any given patient appeared exclusively in either the training or the test set.
Three stride values were considered during the labeling stage to investigate their impact on model performance.
Figure 6 shows the training behavior when the maximum of the UDF function is placed at the right border of each apnea window. The accuracy curves enable a comparative assessment of how stride influences classification performance.
The results show that with the smallest stride, the model does not effectively learn the temporal patterns, whereas increasing the stride up to 15 s allows the model to achieve stable performance, reaching its best accuracy after 7 epochs.
In the inference phase, the model was tested on the third portion of the dataset. We note that the stride value influences the composition of both the training and test datasets, affecting the number of samples as well as the overall difficulty of the task. To properly evaluate the effectiveness of the training process on the same model, it was therefore necessary to establish a common reference baseline. We addressed this by comparing the performance of models trained on three different training datasets against a challenging benchmark task, i.e., performing inference on the test dataset generated with a stride of 5000. This strategy provides a uniform basis for comparison, enabling the evaluation to focus on the models’ generalization capability independently of the original training sample size.
Table 4 and
Table 5 summarize the performance results obtained during the inference phase, where a comprehensive evaluation was carried out across the model instances, trained with the different considered stride values.
Figure 6 shows that the best performance is achieved by the model trained and tested on the 5000-stride dataset; as the stride increases, performance progressively deteriorates. This trend is consistent with the results reported in
Table 4, where the highest accuracy and lowest loss are obtained when both training and testing are performed with 15,000-stride windows. Furthermore, when training and inference are performed on different datasets, the model trained with 15,000-stride windows generally achieves superior performance compared to the 10,000- and 5000-stride configurations, while the 5000-stride model consistently demonstrates inferior accuracy and highest loss values.
This suggests that the 5000-stride model successfully classifies the specific sample types encountered during its training. In contrast, models trained with larger strides exhibit lower accuracy on their own sets because these wider intervals exclude samples that partially overlap with apnea events. Consequently, when the 10,000 and 15,000-stride models are tested on the 5000-stride dataset, they encounter sample variations, specifically those partial overlaps that were absent during their training phase.
Once we identified an optimal stride size, we aimed to evaluate the impact of imperfect detection of apnea intervals from the ECG using our analytical method. In practice, this imperfection introduces both mislabeling errors and time displacements, as detected apnea intervals do not perfectly overlap with the gold-standard annotations.
To simulate this effect, we applied a random time-shift to each apnea interval in the gold-standard annotations. Specifically, for each interval, a random offset was added, where represents the maximum allowed displacement. We then computed the mislabeling rate using the UDF function under these time-shifted conditions. This procedure was repeated for different values of to study the relationship between displacement error and mislabeling rate.
Finally, to assess the impact of such displacement errors on model training, we trained the network multiple times using datasets corresponding to different mislabeling rates induced by these random offsets. This approach allowed us to quantify how sensitivity to interval misalignment affects training performance.
Figure 7 shows the training accuracy and loss for different percentage values of mislabeling of time windows, using a stride of 10 s. As it was explained before, the mislabeling is generated injecting random time displacement in
of the apnea intervals, which in reality is introduced by detection error of the automatic labelling.
In
Figure 8, it is shown how the test accuracy decreases when the mislabeling rate increases. Average and deviation is computed over 10 runs for each percentage value of the mislabeling rate.
Finally,
Figure 9a shows the time required to preprocess patient files into windowed samples and calculate labels via UDF integrals, plotted against file size. Complementing this,
Figure 9b depicts how training duration scales with the size of the training dataset.
8. Discussion an SOTA Comparision
The combination of the presented methods allows for the development of a weakly-supervised learning framework that employs a two-stage cascaded architecture, where ECG-based apnea detection serves as pseudo-ground truth for training a deep learning model on BCG data. The experimental results are here used to evaluate the end-to-end accuracy estimation under imperfect labeling supported by such a framework.
We clarify that the dataset used for experimental activities evaluation is based on controlled breath-holding protocols, rather than full-night PSG-confirmed OSA cohorts, which limits the generalization of quantitative results to clinical settings.
The ECG-based DHR + minpeaks method achieved an F1 score of (precision = 0.842, recall = ) when evaluated against clinical ground truth annotations.
Assuming an event prevalence of
, the expected confusion matrix components can be derived as follows:
This corresponds to an estimated true ECG accuracy of
A secondary neural classifier, trained on a synchronized signal using the detections from the first model as training labels, achieved an apparent accuracy of
with respect to these pseudo-ground-truth annotations. Considering that the reference classifier is not perfectly accurate, the propagated true accuracy of the neural classifier can be estimated as:
where
represents the probability that the labeling is correct.
Substituting the values gives:
that is, an estimated true accuracy of approximately
.
This analysis illustrates that, although the neural model shows a relatively high apparent accuracy when compared to the reference labels, its true performance with respect to real apnea events is obviously lower. The recorded performance primarily characterizes consistency with the training annotations rather than a genuine improvement in event detection accuracy. To avoid ambiguity,
Table 6 provides a concise nomenclature of all accuracy metrics reported in the manuscript.
Table 7 presents a comparison of our end-to-end accuracy with the 98% reported in previous works, like [
8,
18], due to substantial differences in experimental design and signal processing, contrasting idealized classification tasks against our continuous monitoring framework.
As shown in
Table 7, direct comparison of accuracy values requires considering the evaluation protocol. State-of-the-art methods like [
18] report 98% accuracy on pre-segmented event datasets, which effectively removes the challenge of detecting event boundaries and transitional artifacts. Note that the 98% accuracy reported in [
18] is obtained under pre-segmented, event-based evaluation and is therefore not directly comparable with the continuous sliding-window evaluation adopted in this work.
In contrast, our framework operates in a continuous sliding-window regime essential for real-time alerts. Our ECG-based “Teacher” achieves 96% (
) consistency with PSG, with a specificity (TN rate) of 88%, confirming the reliability of our training labels. However, when moving to the fully automated BCG “Student” model, the estimated true accuracy (
) is 77.7%. While numerically lower than [
18], this aligns with or exceeds other continuous monitoring baselines (e.g., [
8] reported ≈50% accuracy for minute-by-minute bed sensor detection), reflecting the realistic trade-off between automation and precision in home settings.
9. Conclusions
This work demonstrates the feasibility of a centralized pipeline that leverages ECG-derived, automatically detected apnea events to pseudo-label multi-channel BCG and train a deep 1D-CNN for home-oriented monitoring, thereby bridging high-end clinical instrumentation and low-cost, unobtrusive devices. The approach integrates a principled UDF-based labeling strategy aligned with temporal criteria, multi-parameter ECG analytics (DHR, LF/HF, EDR) for robust event detection, and stride-aware BCG windowed classification over 12 channels sampled at 100 Hz with 30 s windows.
Methodologically, the study introduces an end-to-end pipeline that can scale beyond manual annotation by aligning PSG-supervised ECG events with synchronized low-cost BCG and converting them into probabilistic labels via a UDF integral consistent with apnea duration semantics. The analysis highlights the importance of temporal alignment and stride selection for balancing sample diversity and convergence, and shows that multi-parameter ECG consensus (notably, sustained LF/HF and peak-coherent DHR) improves robustness over single-indicator detection
Propagating reference-label imperfection yields an estimated true end-to-end accuracy of approximately 77.7%, clarifying that apparent gains primarily reflect consistency with pseudo-labels and could be used for an unobtrusive continuous monitoring at home of the user’s well-being, complementing clinical information.
The principal limitations are the small-scale optimization on a limited number of subjects with breath-holding protocols rather than full-night PSG-confirmed OSA cohorts and a large model footprint that currently constrains embedded deployment.
Future work will prioritize validation on larger, clinically representative datasets, expansion to PPG-based surrogates suitable for real-world acquisition, and model compression via distillation and pruning to enable efficient embedded inference. In parallel, evaluating federated learning against the centralized baseline and refining UDF timing to better account for systematic displacement can strengthen privacy-preserving, continuous home monitoring. Additionally, we will investigate hybrid time-frequency representations, such as spectrogram-based inputs, to potentially enhance performance by combining explicit frequency-domain information with our current time-domain approach.