Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis

Lin, Songlu; Tang, Renzheng; Wang, Yuzhe; Wang, Zhihong

doi:10.3390/app15148077

Open AccessArticle

Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis

¹

Instrument Science and Electrical Engineering, Jilin University, Changchun 130012, China

²

Biomedical Diagnostics Lab, Department of Electrical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands

³

Eindhoven Hendrik Casimir Institute, Electro-Optical Communication Systems, Eindhoven University of Technology, 5612 AZ Eindhoven, The Netherlands

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8077; https://doi.org/10.3390/app15148077

Submission received: 18 June 2025 / Revised: 11 July 2025 / Accepted: 15 July 2025 / Published: 20 July 2025

(This article belongs to the Special Issue Machine Learning in Biomedical Applications)

Download

Browse Figures

Versions Notes

Abstract

Accurate sleep staging and sleep-disordered breathing (SDB) severity prediction are critical for the early diagnosis and management of sleep disorders. However, real-world polysomnography (PSG) data often suffer from modality heterogeneity, label scarcity, and non-independent and identically distributed (non-IID) characteristics across institutions, posing significant challenges for model generalization and clinical deployment. To address these issues, we propose a federated multi-task learning (FMTL) framework that simultaneously performs sleep staging and SDB severity classification from seven multimodal physiological signals, including EEG, ECG, respiration, etc. The proposed framework is built upon a hybrid deep neural architecture that integrates convolutional layers (CNN) for spatial representation, bidirectional GRUs for temporal modeling, and multi-head self-attention for long-range dependency learning. A shared feature extractor is combined with task-specific heads to enable joint diagnosis, while the FedAvg algorithm is employed to facilitate decentralized training across multiple institutions without sharing raw data, thereby preserving privacy and addressing non-IID challenges. We evaluate the proposed method across three public datasets (APPLES, SHHS, and HMC) treated as independent clients. For sleep staging, the model achieves accuracies of 85.3% (APPLES), 87.1% (SHHS_rest), and 79.3% (HMC), with Cohen’s Kappa scores exceeding 0.71. For SDB severity classification, it obtains macro-F1 scores of 77.6%, 76.4%, and 79.1% on APPLES, SHHS_rest, and HMC, respectively. These results demonstrate that our unified FMTL framework effectively leverages multimodal PSG signals and federated training to deliver accurate and scalable sleep disorder assessment, paving the way for the development of a privacy-preserving, generalizable, and clinically applicable digital sleep monitoring system.

Keywords:

federated learning; sleep staging; sleep-disordered breathing; multi-task learning; polysomnography; deep learning

1. Introduction

Sleep is a fundamental physiological process crucial for maintaining physical health, cognitive abilities, emotional stability, and overall quality of life [1,2]. Sleep-disordered breathing (SDB) is an important and prevalent sleep-related disorder, characterized by recurrent episodes of apnea and hypopnea during sleep [3]. Beyond impairments in daytime functioning and quality of life, SDB is also associated with increased long-term morbidity, including elevated risks of cardiovascular and neurocognitive disorders [4,5]. In clinical sleep medicine, the gold standard for diagnosing sleep disorders is overnight polysomnography (PSG) [6], which records a wide range of physiological signals such as electroencephalogram (EEG), electrooculogram (EOG), and electromyogram (EMG), as well as airflow and/or pressure [7], thoracoabdominal movements [8], electrocardiogram (ECG), and blood oxygen saturation (SpO₂). The apnea–hypopnea index (AHI), defined as the total number of apnea and hypopnea events divided by total sleep time (TST), is commonly used for the preliminary assessment of SDB severity [9]. Traditionally, while PSG provides comprehensive and accurate diagnostic information, it is limited by long waiting lists, complex setup, high costs, and the discomfort caused by multiple sensors and wires during measurement [10,11]. These limitations reduce the accessibility of PSG-based diagnosis and make it unsuitable for long-term SDB monitoring. To address these limitations, home sleep apnea testing (HSAT) devices with simplified sensors have gained increasing attention as alternative tools for SDB diagnosis and AHI estimation [12,13]. However, HSAT still requires nontrivial setup, manual analysis, and interpretation, and often relies on total recording time rather than actual TST, which may lead to underestimation of AHI [13]. Therefore, there is a strong demand for more cost-effective, accessible, and scalable solutions for SDB monitoring and AHI assessment.

In recent years, researchers have explored the use of various single-channel or simplified multichannel physiological signals—including respiratory [14], heart rate [15], ECG [16], EEG [17], and audio [9]—for sleep staging and SDB severity assessment. For example, Jiali Xie [9] proposed a multi-task learning approach using cardiac and audio signals for SDB severity classification and AHI estimation, achieving an accuracy of 57.8% for SDB severity classification on the SOMNIA dataset. Caihong Zhao [18] proposed a multi-task learning model for sequence signal reconstruction and sleep staging, achieving 85.6% accuracy for sleep staging on the SHHS dataset, demonstrating the potential of multi-task learning networks. Lei Shi [19] proposed an RGMNet, which achieves sleep staging accuracies of 86.61%. Furthermore, different physiological modalities capture complementary aspects of SDB pathophysiology. For instance, ECG enables the extraction of surrogate respiratory patterns and RR intervals through heart rate variability (HRV) analysis, which can assist in SDB severity assessment, while EEG remains the primary signal for sleep staging. Therefore, integrating multiple physiological signals can enhance the performance of both SDB monitoring and sleep staging. Although these two tasks share common physiological foundations and intrinsic correlations [20], studies on unified multi-task learning frameworks capable of simultaneously addressing both tasks remain limited [21].

At the same time, increasing emphasis on data privacy and security has made it difficult to aggregate sleep data across hospitals, centers, or devices, which severely limits the generalizability of deep learning models in real-world clinical settings [22]. Moreover, real-world sleep data collected from different institutions often exhibit non-independent and identically distributed (non-IID) characteristics due to variations in population demographics, sensor configurations, and annotation standards. These distributional discrepancies pose significant challenges for collaborative training and can degrade model performance if not properly addressed [23,24]. To overcome these obstacles, federated learning (FL) [25,26] has emerged as a promising paradigm for collaborative model training across institutions, enabling improved generalization and privacy preservation without sharing raw data. Federated multi-task learning (FMTL) further allows each center to jointly train shared feature representations while retaining personalized task-specific parameters, thereby facilitating robust cross-institution SDB monitoring and sleep staging [27,28]. Although various FL methods have been proposed in the broader machine learning literature, few have focused on combining FL with multi-task learning in the specific context of sleep health. For example, Pietro Fusco et al. [29] employed a federated learning framework to analyze ECG data to detect sleep apnea, achieving an accuracy of 79.77%. Nuria Lebeña et al. [30] applied clinical federated learning to classify ICD-10 using electronic health records from multiple Spanish hospitals. However, to the best of our knowledge, the integration of multi-task learning with federated learning for comprehensive and privacy-preserving sleep monitoring remains largely unexplored. Motivated by this gap, this study aims to achieve two key objectives:

To develop a unified multi-task framework that leverages multimodal PSG signals, including EEG, respiration, ECG, snoring, and SpO₂, for joint sleep staging and SDB severity classification, thereby enhancing diagnostic accuracy and efficiency.
To address the non-IID challenge inherent in real-world sleep data through federated learning, enabling decentralized model training across clinical centers while protecting patient privacy and improving cross-site generalization.

To this end, we propose a multimodal, multicenter federated multi-task learning framework that integrates a hybrid deep neural network architecture with a parameter-averaging federated optimization strategy. This approach provides a scalable, privacy-preserving solution for automated sleep disorder assessment and lays the groundwork for a fully digitalized, generalized sleep health monitoring system.

2. Methods: FMTL Framework

The overall system structure is shown in Figure 1. The FMTL framework is deployed across multiple clients, each corresponding to a dataset from a different institution for simulations. At the start of each communication round, the server broadcasts the current global model parameters to all clients. Each client then performs local training on its private data, which consists of 30 s PSG signal epochs, including EEG, EEG, ECG, respiratory, EOG, EMG, SpO₂, and snoring. After completing its local updates, each client sends only the updated model parameters back to the central server. The server aggregates these parameters using the federated averaging (FedAvg) algorithm [31], producing a new global model that is then redistributed to every client. This loop of “broadcast → local training → parameter aggregation → broadcast” continues for multiple rounds until the global model converges.

2.1. FedAvg Algorithm

The FMTL framework extends the standard FedAvg algorithm to support multi-task learning with client-specific model heads. Specifically, we assume the global model is denoted as

θ = {θ_{sh}, θ_{stage}, θ_{sdb}}

(1)

where

θ_{sh}

represents the shared encoder, and

θ_{stage}

and

θ_{sdb}

are the task-specific heads for sleep staging and SDB severity classification, respectively.

At the beginning of each communication round

t

, the server selects a random subset of clients and broadcasts the current shared parameters

θ_{sh}^{t}

to them. Each selected client

k

performs

E

epochs of local multi-task training using its private PSG data and returns the updated shared parameters

θ_{sh, k}^{(t + 1)}

. The server then aggregates these updates using a sample-weighted average:

θ_{th}^{(t + 1)} = \frac{\sum_{k ϵ S_{t}} n_{k} θ_{sh, k}^{(t + 1)}}{\sum_{j ϵ S_{t}} n_{j}}

(2)

where

S_{t}

is the set of clients that successfully upload updates in round

t

, and

n_{k}

is the number of local training samples at client

k

. Importantly, each client retains its own task-specific heads

θ_{stage, k}

and

θ_{sdb, k}

, which are not aggregated and evolve independently to support personalized adaptation.

To emulate realistic deployment conditions in clinical federated learning, we modeled client-side heterogeneity during training. Each participating client was assigned a randomized training time budget

τ_{k} \in [60, 180]

seconds, reflecting local computational capacity variability. If a client fails to complete its prescribed local training within this time limit, it prematurely terminates and is excluded from the current aggregation round. In addition, we simulate network variability by sampling each client’s available upload bandwidth

b_{k}

from a range of 50 to 500 KB/s. This constraint affects the time required to transmit model updates to the central server. Furthermore, to capture potential device dropout or environmental interruptions, each client faces an independent stochastic dropout probability of

p_{drop} = 0.1

. These mechanisms collectively introduce realistic training dynamics that mirror operational constraints in multi-institutional healthcare settings, ensuring that our federated protocol remains robust under non-ideal conditions. If a client fails to finish training within its time limit or exceeds a 60 s communication window, it is excluded from that round’s aggregation. Furthermore, if fewer than

[C \cdot K / 2]

clients return updates, the server skips aggregation and retains the previous global model, i.e.,

θ_{sh}^{(t + 1)} = θ_{sh}^{(t)}

. To enhance privacy, each client may optionally apply gradient clipping and inject Gaussian noise into its updates to satisfy

(ϵ, δ)

differential privacy.

We adopt a modified FedAvg training loop, outlined in Algorithm 1, where only the shared feature extractor is globally aggregated, while task-specific heads are retained locally on each client to support personalization. At the beginning of each round, a random subset of clients receives the global encoder and trains for a fixed number of local epochs. Clients that fail due to computational or network constraints are excluded from aggregation. The server aggregates updates from successful clients using sample-size-weighted averaging.

Algorithm 1. Federated multi-task training loop (FedAvg).

Input: Global shared encoder

θ_{sh} (0)

, client datasets

D_{k}

, number of rounds

T

, local epochs

E

, batch size

B

for

t = 1

to

T

do
Server selects random subset

S_{t} \subseteq \{1, \dots, K\}

(20% of clients)
for each client

k \in S_{t}

in parallel do
Receive

θ_{sh} (t)

from server
Update local model via multi-task training for E epochs:

θ_{sh}, k (t + 1), h_{stage}, k, h_{sdb}, k \leftarrow LocalTrain (D_{k}, θ_{sh} (t), B)

Upload

θ_{sh}, k (t + 1)

to server if training completed within

τ_{k}

end for
Server aggregates:

θ_{sh} (t + 1) \leftarrow \sum k \in S_{t} (n_{k} / \sum j \in S_{t}, n_{j}) * θ_{sh}, k (t + 1)

end for
Return: Final global encoder

θ_{sh} (T)

2.2. Multi-Task Learning Framework

As illustrated in Figure 2, the FMTL model adopts a shared-bottom architecture that comprises a unified feature extractor and two task-specific prediction heads. The model takes 30 s multichannel PSG signal segments as input. The shared feature extractor first applies a series of three one-dimensional convolutional layers to capture localized time-frequency patterns, followed by batch normalization to stabilize learning. The output is then processed by a bidirectional gated recurrent unit (GRU) to model temporal dependencies across signal sequences. To further enhance context awareness, a multi-head self-attention mechanism is applied to the GRU outputs, allowing the model to learn dynamic importance weights across time steps.

Building on the shared representation, the model branches into two heads. The sleep staging head uses a Bi-LSTM network to extract sequential context, followed by dropout regularization and a dense SoftMax layer for five-class classification into the sleep stages W, N1, N2, N3, and REM. In parallel, the SDB severity classification head applies a GRU layer followed by attention pooling, a dense projection layer, and a SoftMax activation to output four-class prediction of SDB severity: normal (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30), and severe (AHI ≥ 30), in accordance with clinical AHI thresholds. The multi-task model is optimized using a joint loss function:

{\hat{y}}_{stage} = Softmax (F_{stage} (f_{shared}))

(3)

Loss = λ_{1} \cdot WeightedCrossEntropy ({\hat{y}}_{stage}, y_{stage}) + λ_{2} \cdot WeightedCrossEntropy ({\hat{y}}_{sdb}, y_{sdb})

(4)

where

{\hat{y}}_{stage}

and

{\hat{y}}_{sdb}

are the predicted outputs of the sleep staging and SDB classification heads, respectively, and

y_{stage}

and

y_{sdb}

are the corresponding ground truth labels. In our implementation, both task weights

λ_{1}

and

λ_{2}

are set to 0.5. The term WeightedCrossEntropy indicates that class-specific weights are applied to account for label imbalance in both tasks, ensuring that minority classes receive greater emphasis during training, which referred to previous research [17,32].

3. Materials and Experiment Design

3.1. Dataset Description

In this study, as shown in Table 1, we adopted three PSG datasets to which access was obtained, namely the Apnea Positive Pressure Long-term Efficacy Study (APPLES), Sleep Heart Health Study (SHHS), and Haaglanden Medisch Centrum sleep staging dataset (HMC) to evaluate the effectiveness of the proposed FMTL framework. The details and the use of each dataset are provided in the following sections.

3.1.1. APPLES Dataset [33,34]

The Apnea Positive Pressure Long-term Efficacy Study (APPLES) was a multicenter, double-blind, sham-controlled trial evaluating CPAP efficacy in OSA patients. From November 2003, a total of 1516 participants were recruited and studied for up to 6 months at 11 visits. Of these, 1105 participants were randomly assigned to either an active or sham CPAP machine (REMstar Pro, Philips Respironics, Murrysville, PA, USA); the sham CPAP machine had airflow and noise operating very similar to the active CPAP machine at the expiratory port. A total of 1098 participants diagnosed with obstructive sleep apnea (OSA) participated in the analysis of the primary outcome measures. The APPLES dataset was completed in August 2008. Only subjects with complete PSG data in the active CPAP group were included in this study, a total of 382 participants, and the screening process is shown in Figure 3.

3.1.2. SHHS Dataset [33,35]

The SHHS is a multicenter cohort study focusing on the cardiovascular consequences of sleep-disordered breathing. Subjects enrolled in the study have a variety of conditions, ranging from pulmonary and cardiovascular diseases to coronary complications. A total of 6441 men and women aged 40 years and older were recruited to the first SHHS visit between 1 November 1995 and 31 January 1998. Due to data sharing rules for certain groups and subjects, this study used the SHHS1 dataset, which contains 5463 subjects.

3.1.3. HMC Dataset [36,37,38]

The HMC in The Hague, the Netherlands, collected a diverse dataset of 151 whole-night PSG sleep recordings in 2018. This collection includes data from 85 male and 66 female participants with an age of 53.9 ± 15.4 years (mean ± standard deviation). All signals were sampled at 256 Hz. Signals were recorded using SOMNOscreen PSG, PSG+, and EEG 10–20 recorders (SOMNOmedics, Randersacker, Germany) using AgAgCl electrodes. Scoring of sleep stages was carried out manually by well-trained sleep technicians according to the 2.4 version of the AASM guidelines.

For the sleep staging task, we followed standard preprocessing steps as described in our prior work [17], which included segmentation of PSG recordings into 30 s epochs and label alignment based on expert annotations following AASM criteria. Sleep stage classification was categorized into five classes using expert manual annotations for each dataset: W, N1, N2, N3, and REM. Signals from EEG, ECG, and respiratory channels were normalized to zero mean and unit variance within each subject to reduce inter-patient variability. For SDB severity classification, we adopt a proxy label extraction strategy based on ECG-derived respiratory analysis. Specifically, we implement an ECG-based AHI approximation pipeline inspired by prior cardiopulmonary coupling studies [39]. The core idea is to infer SDB events using surrogate features extracted from R-R interval (RRI) dynamics, which reflect autonomic nervous system fluctuations associated with apneic episodes. First, we extracted the RRI time series from ECG signals using a peak detection algorithm (e.g., Pan–Tompkins) followed by interpolation at 4 Hz to generate evenly sampled tachograms. Each night-long ECG signal was then segmented into overlapping 5 min windows with 30 s strides. Within each window, we computed short-term HRV features that are physiologically sensitive to respiratory effort and arousals. These include the standard deviation of NN intervals (SDNN), root mean square of successive differences (RMSSD), low- and high-frequency power ratios (LF/HF), and the respiratory-modulated cardiac oscillation index (RMCOI), a proxy for cyclic variation in heart rate (CVHR). To identify putative apnea–hypopnea events, we applied an unsupervised peak detection approach on the CVHR signal within each 5 min segment. Peaks that met duration (≥10 s) and amplitude criteria (adaptive thresholding) were marked as surrogate events. The local AHI was then computed as the number of surrogate events per hour, scaled appropriately to the 5 min window. Finally, each 30 s epoch was assigned a severity label (normal, mild, moderate, or severe) based on the maximum AHI observed in its surrounding window. This process yields temporally aligned, discrete SDB labels that can be directly used for multi-task learning. These ECG-derived AHI values are then mapped into four discrete severity levels: normal (AHI < 5), mild (5 ≤ AHI < 15), moderate (15 ≤ AHI < 30), and severe (AHI ≥ 30), in line with clinical thresholds.

In this study, we selected seven physiological channels from each dataset to construct a consistent multimodal input space for sleep staging and SDB severity classification. For the EEG signal, we chose one central derivation per dataset based on availability and reliability: C3-M2 from APPLES, C4-M1 from SHHS, and F4-M1 from HMC. For EOG, which assists in REM detection through eye movement patterns, we selected ROC-M1 for APPLES, the left EOG channel for SHHS, and LOC-M2 for HMC. The EMG channel, used to track muscle tone fluctuations, was extracted from chin recordings: Chin EMG in APPLES, standard EMG in SHHS, and Chin1-Chin2 in HMC. For ECG, which provides heart rate variability cues linked to SDB events, we used a single ECG lead from each dataset. In terms of respiratory signals, we selected the most representative airflow-related channel for apnea detection: Nasal Pressure for APPLES, Airflow for SHHS, and Nasal Thermistor for HMC. To assess desaturation events, we included SpO₂ signals from all three datasets. Finally, snore-related audio signals were incorporated via the Neck Microphone in APPLES and the standard Snore channels in SHHS and HMC. These selected modalities were harmonized across datasets to maximize physiological diversity and support robust multimodal representation learning.

To simulate a realistic federated clinical collaboration scenario, we adopted a server-to-client pretraining and federated fine-tuning framework across multiple datasets. Specifically, following our previous work [17], we divided the SHHS dataset into the SHHS subset, which contains 329 patients, and used it as a centralized proxy dataset for server-side pretraining. A multi-task model was initialized and trained on this data to capture general representations across sleep stages and SDB severity levels. The resulting global model parameters, which include a shared feature extractor and two task-specific heads, were then distributed to all participating clients, including APPLES, SHHS data (denoted as SHHS_rest), and the HMC. The specific details of each dataset are shown in Table 1.

3.2. Experiment Design

The training task is divided into two stages: centralized pretraining and federated fine-tuning. In the pretraining stage, we used a subset of 329 subjects from the SHHS dataset (denoted as SHHS_pretrain) to initialize the global model in a centralized manner. This model includes a shared feature extractor and two task-specific heads (for sleep staging and SDB severity classification). The resulting parameters serve as the starting point for the subsequent federated training phase.

In the federated setting, we simulated three institutional clients, each corresponding to one dataset: APPLES, HMC, and the remaining SHHS (denoted as SHHS_rest). For all clients (i.e., APPLES, HMC, SHHS_rest), the available subjects are randomly partitioned at the patient level into non-overlapping training, validation, and test subsets in an 8:1:1 ratio. Federated learning is conducted over

T = 100

communication rounds, and in each round, a random subset of

C = 0.2 \times K

clients (with

K = 3

) is selected to participate. Each selected client performs E = 5 epochs of local multi-task training using a batch size of

B = 32

on its private PSG data. Model updates are computed and returned to the server, which aggregates only the shared encoder parameters using the weighted averaging strategy described in Section 2.1. Task-specific heads are retained locally on each client to allow for personalized adaptation. The validation set is used solely for early stopping and hyperparameter selection, with no access to test data at any stage of training. Early stopping is based on the validation loss of the sleep staging task, with a patience of 10 communication rounds. Model checkpoints yielding the lowest validation loss are retained for final evaluation. Evaluation results are reported separately under two settings. In the local evaluation setting, each client reports test performance using its own local model on its respective test set. In the cross-client evaluation setting, a model trained on one client is directly applied to the test set of a different client without further adaptation.

To simulate realistic deployment scenarios, we introduced heterogeneity in computational and communication environments across clients. Each client is assigned a random training time budget

τ_{k} \in [60, 180]

seconds and an upload bandwidth

b_{k} \in [50, 500]

KB/s. Furthermore, an independent stochastic dropout probability of

p_{drop} = 0.1

is applied. These settings are consistent with the federated simulation protocol introduced in Section 2.1 and are designed to emulate real-world variability in institutional resources.

To enable parameter optimization, training, and evaluation of models, we chose Python 3.10 as the programming language and TensorFlow 2.13.0 as the deep learning framework. Three NVIDIA RTX A5000 GPUs were used to perform computations. We used the “ReduceLROnPlateau” method [40] to adjust the learning rate during training. The key training hyperparameters used in all experiments are summarized in Table 2. Each experiment was repeated five times with different random seeds, and results are reported as the average.

The training hyperparameters listed in Table 2 were selected based on expert debugging and empirical validation to ensure stable convergence and effective optimization across tasks and clients. Specifically, the initial learning rate of 1 × 10⁻³ was chosen as a commonly effective starting point for the Adam optimizer and further adjusted dynamically during training using the ReduceLROnPlateau strategy to avoid overfitting. The task loss weights (

λ_{1} = λ_{2} = 0.5

) reflect equal emphasis on sleep staging and SDB classification tasks in the joint loss function, as both are clinically important and complementary. The choice of 5 local epochs per communication round ensures adequate local updates while preventing client drift, which is a common issue in non-IID federated settings. For federated parameters, the number of communication rounds and participation rates follow established practices in the FL literature [41,42,43], where partial client participation promotes robustness under heterogeneous environments. The simulation of heterogeneous training time limits (τ_k) and bandwidth constraints (b_k), along with a client dropout probability, is designed to reflect practical deployment variability and enhance the realism of the evaluation setting. These hyperparameters were further validated via grid search on the validation sets to ensure consistent performance across datasets.

3.3. Evaluation Metrics

To comprehensively evaluate the effectiveness of the proposed FMTL framework, we employed a suite of standard classification metrics tailored to the characteristics of each task. For the sleep staging task, we assessed performance using overall accuracy, Cohen’s Kappa coefficient [40], macro-averaged F1-score (MF1), and macro geometric mean (MGM). We also compute class-specific recall and specificity for both tasks. For the SDB severity classification task, we use overall accuracy, macro-F1 score (MF1), and macro-area under the receiver operating characteristic curve (mAUC) to capture both overall and class-level discriminative ability. In addition, we computed per-class recall, precision, sensitivity, and specificity are reported for each severity category to provide insight into the model’s clinical utility.

To assess the generalization capabilities of the proposed framework, we evaluate the model under two distinct settings: local evaluation, in which each client is tested on its own held-out dataset, and cross-client evaluation, where a model trained on one institution is applied to data from another. This dual evaluation strategy enables us to quantify both personalized performance and cross-domain transferability under non-IID conditions. All metrics are averaged across five independent runs with random initialization to ensure statistical robustness.

4. Results

4.1. Sleep Staging Performance

The proposed FMTL model demonstrates consistent and strong performance in the sleep staging task across the APPLES, SHHS_rest, and HMC datasets. As shown in Table 3 and the confusion matrices in Figure 4, the model achieves an overall accuracy of 85.3% on APPLES, 87.1% on SHHS_rest, and 79.3% on HMC. In terms of Cohen’s Kappa coefficient (κ), which reflects agreement with expert annotations, the results are 0.71 (APPLES), 0.86 (SHHS_rest), and 0.72 (HMC), indicating substantial reliability despite domain shifts between datasets. The MF1 score also shows competitive generalization, reaching 76.7% on APPLES, 81.5% on SHHS_rest, and 77.0% on HMC. Specifically, per-class MF1 performance demonstrates that the stages N2 and Wake are most accurately identified across all datasets (e.g., N2-F1: 93.5% for APPLES, 89.5% for SHHS_rest, 82.4% for HMC), while N1 remains the most challenging stage (N1-F1: 53.7%, 57.8%, and 61.4%, respectively). REM and N3 also exhibit relatively strong performance across the three cohorts.

4.2. SDB Severity Classification Performance

The FMTL framework demonstrated strong generalization ability in the SDB severity classification task across all three datasets. As shown in Table 4 and Figure 5, the model achieved per-class MAUCs of 85.2%, 71.1%, 90.3%, and 82.8% on the APPLES dataset for the normal, mild, moderate, and severe categories, respectively. The overall accuracy, MF1 score, and MAUC reached 79.2%, 74.6%, and 82.4%, respectively, with a macro recall of 73.4% and macro specificity of 91.3%. On the SHHS_rest dataset, the model further improved, achieving per-class MAUCs of 86.7%, 82.2%, 90.6%, and 94.9% for the four categories. The overall accuracy, MF1, and MAUC were 84.5%, 83.8%, and 88.6%, respectively, with macro recall and specificity reaching 82.7% and 94.5%. Similarly, the HMC dataset results confirm the robust generalization of the framework, with per-class MAUCs of 92.1%, 82.9%, 89.5%, and 95.4%, and overall accuracy, MF1, and MAUC of 85.4%, 85.0%, and 89.9%, respectively. Both macro recall (84.8%) and macro specificity (95.0%) indicate high reliability in distinguishing all four SDB severity levels. The confusion matrices in Figure 4 further illustrate that the model maintains good balance in classification across different categories and datasets, though slight variation in performance can be attributed to differences in class distribution and signal quality among the cohorts.

4.3. FMTL Framework Performance

To evaluate the effectiveness of our proposed FMTL framework, we conducted comparative experiments from two perspectives. First, as illustrated in Figure 6, we examined the impact of the federated learning strategy by comparing three settings: (a) Single: Models are trained and tested independently on each client’s local data. (b) Mix: Data from all clients is pooled together in a centralized setting to train a shared global model. (c) FedAvg: A standard federated averaging algorithm used to simulate distributed collaborative learning without data sharing in this study. As summarized in Table 5, under simulated multi-client conditions, the FMTL framework achieved the highest accuracy for both sleep staging and SDB severity classification, demonstrating its ability to provide reliable diagnostic performance without compromising data privacy. Specifically, the APPLES client shows an increase in accuracy from 78.2% (Single) and 83.5% (Mix) to 85.3% (FMTL), and an MF1 improvement from 69.1% (Single) and 72.2% (Mix) to 76.7% (FMTL). Similar trends are observed for SHHS and HMC clients, highlighting the superior generalization ability of FMTL even in the presence of data heterogeneity. For OSA severity classification, the FMTL model also achieves the best results, with the highest accuracy and MAUC on all clients. For instance, on the HMC dataset, the accuracy rises from 82.5% (Single) and 83.1% (Mix) to 85.4% (FMTL), while the MAUC improves from 85.1 (Single) and 86.7 (Mix) to 89.9 (FMTL). These gains confirm that the federated multi-task learning framework can effectively leverage distributed data to boost model robustness and classification power, while preserving local data privacy.

5. Discussion

The proposed FMTL framework consistently performed well in both sleep staging and SDB severity classification tasks across all three datasets, validating its capacity to generalize under real-world non-IID data distributions. The use of multi-task learning significantly contributed to model performance. Compared with single-task baselines, joint optimization of sleep staging and SDB classification allowed the model to benefit from shared information between the two tasks, which are physiologically correlated.

In this study, a deep neural network was constructed based on seven-channel multimodal signals (EEG, EOG, EMG, ECG, respiratory airflow, blood oxygen saturation, snoring) to achieve joint recognition of sleep stages and SDB severity. Compared with traditional single-channel or unimodal models, multimodal input not only improved the expressive power of feature space but also prompts the model to learn shared and specific representations between modalities during encoding. In our shared CNN-GRU backbone structure, early fusion guided the network to establish a unified time-frequency representation at a low level, while task-specific branches further explore task-related modalities. High accuracy and Kappa coefficients were maintained across datasets (SHHS_rest and HMC), indicating that multimodal fusion not only improves task discrimination but also improves the adaptability to individual differences, device differences, and labeling styles to a certain extent. For difficult-to-distinguish states such as N1, REM, and mild-to-moderate SDB, the multimodal model showed significantly better performance than the unimodal baseline, indicating that it indeed captured compensatory cues across modalities.

We particularly observed that the classification performance of the N1 stage was significantly improved compared with traditional methods. As is known, the EEG activity of the N1 stage is usually between wakefulness and N2, lacking clear K-complex or spindle events, making it very easy to be confused with wakefulness or N2 in visual annotation [44]. The literature shows that even the consistency of annotation of the N1 stage among clinical sleep experts is only about 40–60%, which is significantly lower than other stages. The model based on the federated multi-task learning framework in this study achieved considerable F1 scores in N1 recognition (such as APPLES: 53.7%, SHHS_rest: 57.8%, HMC: 61.4%), which are significantly improved compared with previous single-task or single-modality methods. We believe that this is due to the information transfer between tasks. In the process of jointly optimizing the sleep staging and SDB classification tasks, the multi-task learning model can learn more abstract and discriminative shared features. In actual training, the model is forced to be sensitive to subtle physiological changes associated with respiratory disorders, which often occur in the N1 stage at the beginning of sleep. For example, when breathing is mildly restricted, individuals are prone to micro-arousals and frequent N1 regressions [45], which imply a certain statistical association between N1 and SDB labels. This cross-task structural information provides additional indirect supervision for the N1 stage and enhances its ability to express discriminative boundaries. In addition, we also believe that multimodal fusion to enhance boundary modeling may be helpful. Compared with traditional methods that rely only on EEG signals, we integrate multiple modalities such as EOG, EMG, ECG, and respiratory airflow. Among them, EOG’s capture of micro-eye movement signals can help identify slow rolling eye movements in N1; EMG can reflect subtle changes in muscle tension, which helps to distinguish the difference between low-tension N1 and fully awake states; and ECG and respiratory signals also show certain autonomic nervous system adjustment characteristics during the N1 period. This modal information jointly participates in temporal modeling in the shared encoder, enabling the model to recognize the “dynamic transitional” characteristics of N1 rather than relying solely on vague static patterns under a single modality.

Table 6 provides a summary of several representative studies, including details on the authors, network architecture, input modalities, output tasks, datasets, and reported results. These works span a range of methodologies, from deep learning models such as CNNs and RNNs to classical machine learning approaches like random forests and gradient boosting. However, due to variations in dataset composition, input modalities, and experimental protocols, it remains difficult to make direct and fully fair performance comparisons across studies. For instance, differences in signal preprocessing, segment lengths, subject populations, and evaluation metrics (e.g., cross-validation vs. independent test sets) all contribute to discrepancies in reported outcomes. Nonetheless, despite these inherent challenges in cross-study comparability, our proposed FMTL framework achieves consistently competitive or superior results across multiple public datasets, demonstrating its robustness and adaptability.

From a deployment standpoint, the proposed framework is designed to operate under practical computational constraints. During federated training, each client only performs lightweight local updates with five epochs and a batch size of 32, which can be executed efficiently on standard institutional servers equipped with mid-range GPUs or even modern CPUs. The model architecture, composed of compact CNN, GRU, and attention layers, was deliberately chosen for its balance between expressiveness and computational efficiency. For low-resource settings, such as small clinics or edge devices, the inference phase requires minimal resources and can be executed in near real time. Local fine-tuning or participation in training can be scheduled during idle periods to minimize system load. These design choices ensure that the framework is feasible for deployment across diverse healthcare environments, including those with limited computational capacity.

However, several limitations remain in the current study. First, although the model was evaluated on three public datasets, the overall sample size is still limited compared to real-world clinical populations. Second, the model mainly focuses on EEG, ECG, and basic respiratory signals; although signals such as SpO₂ and EMG are combined, they do not have a large weight tendency for the model due to data quality issues. In the future, improving data quality or based on other modalities such as audio and snoring may further improve classification performance. Third, the class imbalance in certain severity levels (e.g., mild vs. moderate SDB) may introduce performance bias, requiring further exploration of adaptive loss functions or data augmentation strategies. Addressing this issue may require more adaptive strategies, such as dynamic loss weighting [17] or advanced loss functions like focal loss [52], which can better handle imbalanced distributions during training. In addition, while FedAvg was effective in our setup, the investigation of more advanced federated optimization strategies such as FedProx [53], FedDyn [54], or personalization-aware schemes is left for future work. Moreover, while our current framework focuses on two tasks (sleep staging and SDB severity classification), its modular design can be extended to support additional tasks, such as insomnia detection, multimodal anesthesia monitoring, or longitudinal sleep health tracking. Such extensions would allow for more holistic and fine-grained characterization of sleep states in diverse clinical scenarios. Finally, we believe this framework provides a strong foundation, but future research should explore integration with state-of-the-art models (e.g., transformer-based encoders, contrastive representation learning) to further improve accuracy and enable more comprehensive analysis of sleep disorders, ultimately enhancing the clinical utility and interpretability of the framework. We also envision the deployment of this framework on wearable or non-contact sensing devices, where its decentralized nature and multimodal adaptability would offer unique advantages for real-world, privacy-preserving sleep monitoring.

Looking ahead, several avenues can further enhance the proposed FMTL framework. Addressing class imbalance in SDB severity classification remains a critical challenge, especially for borderline categories such as mild and moderate cases. Future work may explore adaptive solutions such as dynamic loss weighting or advanced loss functions like focal loss to mitigate label skewness and improve model robustness. Additionally, while FedAvg proved effective in our setup, alternative federated optimization strategies such as FedProx, FedDyn, and personalization-aware methods should be investigated to better accommodate client heterogeneity and communication constraints. Importantly, the privacy-preserving design of our FMTL framework inherently aligns with regulatory requirements such as HIPAA and GDPR, as it ensures that sensitive health data remains on local servers and only model updates are shared. To further enhance privacy guarantees, future iterations could incorporate differential privacy techniques during local training or apply secure aggregation protocols at the server side, offering stronger protection against inference attacks and reinforcing clinical trust. Beyond the current dual task setting, the modular structure of the framework facilitates extension to additional applications, including insomnia detection, longitudinal sleep health tracking, and multimodal anesthesia monitoring. These extensions could support more comprehensive and fine-grained assessments of sleep-related conditions. Finally, integrating advanced model architecture such as transformer-based encoders or contrastive representation learning may further improve accuracy and interpretability. We also envision future deployment of this framework on wearable or contactless devices, enabling decentralized, real-time, and privacy-preserving sleep monitoring in both clinical and home settings.

6. Conclusions

In this paper, we proposed a federated multi-task learning (FMTL) framework that jointly performs sleep staging and sleep-disordered breathing (SDB) severity classification. By integrating a shared CNN–GRU–attention encoder with task-specific output heads, the framework learns comprehensive representations from multichannel PSG signals. It leverages federated learning to enable cross-institutional collaboration without compromising data privacy. Extensive experiments on three public PSG datasets simulated by clients (APPLES, SHHS, and HMC). For sleep staging, the model achieved accuracies of 85.3% (APPLES), 87.1% (SHHS_rest), and 79.3% (HMC), with Cohen’s Kappa scores all exceeding 0.71, reflecting high agreement with expert annotations. For SDB severity classification, it attained macro-F1 scores of 77.6%, 76.4%, and 79.1% on the same datasets, validating its capability to effectively differentiate between normal, mild, moderate, and severe SDB cases. The results demonstrated the model’s effectiveness and generalizability under non-IID conditions. The proposed framework consistently achieved high performance in both tasks across diverse populations and clinical settings.

Author Contributions

Conceptualization, S.L. and R.T.; data curation, S.L. and R.T.; formal analysis, Y.W.; methodology, S.L. and R.T.; supervision, Y.W. and Z.W.; writing—original draft, S.L.; writing—review and editing, R.T., Y.W. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study uses three publicly available datasets: APPLES, SHHS, and the HMC dataset. The APPLES and SHHS datasets are hosted by the National Sleep Research Resource (NSRR). Access requires formal application and approval through the NSRR platform. Detailed information and access procedures can be found at the following links: APPLES: https://sleepdata.org/datasets/apples, accessed on 27 September 2024; SHHS: https://sleepdata.org/datasets/shhs, accessed on 15 June 2023. The HMC Sleep Staging dataset is available via PhysioNet for open access. Further details and access instructions are provided at https://physionet.org/content/hmc-sleep-staging/1.1/.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alimova, M.; Djyanbekova, F.; Meliboyeva, D.; Safarova, D. Benefits of sleep. Mod. Sci. Res. 2025, 4, 780–789. [Google Scholar]
Moosavi-Movahedi, A.A.; Moosavi-Movahedi, F.; Yousefi, R. Good sleep as an important pillar for a healthy life. In Rationality and Scientific Lifestyle for Health; Springer: Cham, Switzerland, 2021; pp. 167–195. [Google Scholar]
Ioachimescu, O.C.; Collop, N.A. Sleep-disordered breathing. Neurol. Clin. 2012, 30, 1095–1136. [Google Scholar] [CrossRef] [PubMed]
Quan, S.F.; Gersh, B.J. Cardiovascular consequences of sleep-disordered breathing: Past, present and future: Report of a workshop from the National Center on Sleep Disorders Research and the National Heart, Lung, and Blood Institute. Circulation 2004, 109, 951–957. [Google Scholar] [CrossRef] [PubMed]
Leng, Y.; McEvoy, C.T.; Allen, I.E.; Yaffe, K. Association of sleep-disordered breathing with cognitive function and risk of cognitive impairment: A systematic review and meta-analysis. JAMA Neurol. 2017, 74, 1237–1245. [Google Scholar] [CrossRef] [PubMed]
Rundo, J.V.; Downey, R., III. Polysomnography. Handb. Clin. Neurol. 2019, 160, 381–392. [Google Scholar] [PubMed]
Gonzalez-Bermejo, J.; Perrin, C.; Janssens, J.; Pepin, J.; Mroue, G.; Léger, P.; Langevin, B.; Rouault, S.; Rabec, C.; Rodenstein, D. Proposal for a systematic analysis of polygraphy or polysomnography for identifying and scoring abnormal events occurring during non-invasive ventilation. Thorax 2012, 67, 546–552. [Google Scholar] [CrossRef] [PubMed]
Lin, Y.-Y.; Wu, H.-T.; Hsu, C.-A.; Huang, P.-C.; Huang, Y.-H.; Lo, Y.-L. Sleep apnea detection based on thoracic and abdominal movement signals of wearable piezoelectric bands. IEEE J. Biomed. Health Inform. 2016, 21, 1533–1545. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; Fonseca, P.; van Dijk, J.; Overeem, S.; Long, X. A multi-task learning model using RR intervals and respiratory effort to assess sleep disordered breathing. Biomed. Eng. OnLine 2024, 23, 45. [Google Scholar] [CrossRef] [PubMed]
Campbell, A.J.; Neill, A.M. Home set-up polysomnography in the assessment of suspected obstructive sleep apnea. J. Sleep Res. 2011, 20, 207–213. [Google Scholar] [CrossRef] [PubMed]
Portier, F.; Portmann, A.; Czernichow, P.; Vascaut, L.; Devin, E.; Benhamou, D.; Cuvelier, A.; Muir, J.F. Evaluation of home versus laboratory polysomnography in the diagnosis of sleep apnea syndrome. Am. J. Respir. Crit. Care Med. 2000, 162, 814–818. [Google Scholar] [CrossRef] [PubMed]
Zancanella, E.; do Prado, L.F.; de Carvalho, L.B.; Machado Júnior, A.J.; Crespo, A.N.; do Prado, G.F. Home sleep apnea testing: An accuracy study. Sleep Breath. 2022, 26, 117–123. [Google Scholar] [CrossRef] [PubMed]
Golpe, R.; Jime, A.; Carpizo, R. Home sleep studies in the assessment of sleep apnea/hypopnea syndrome. Chest 2002, 122, 1156–1161. [Google Scholar] [CrossRef] [PubMed]
Gutiérrez-Tobal, G.C.; Hornero, R.; Álvarez, D.; Marcos, J.V.; Del Campo, F. Linear and nonlinear analysis of airflow recordings to help in sleep apnoea–hypopnoea syndrome diagnosis. Physiol. Meas. 2012, 33, 1261. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Lin, Y.; Wang, J. A RR interval based automated apnea detection approach using residual network. Comput. Methods Programs Biomed. 2019, 176, 93–104. [Google Scholar] [CrossRef] [PubMed]
Lin, S.; Wang, Y.; Wang, Z. An Automatic Multi-Head Self-Attention Sleep Staging Method Using Single-Lead Electrocardiogram Signals. Comput. Cardiol. 2024, 51, 1–4. [Google Scholar]
Lin, S.; Wang, Z.; van Gorp, H.; Xu, M.; van Gilst, M.; Overeem, S.; Linnartz, J.-P.; Fonseca, P.; Long, X. SSC-SleepNet: A Siamese-Based Automatic Sleep Staging Model with Improved N1 Sleep Detection. IEEE J. Biomed. Health Inform. 2025, 1–13. [Google Scholar] [CrossRef] [PubMed]
Zhao, C.; Li, J.; Guo, Y. Sequence signal reconstruction based multi-task deep learning for sleep staging on single-channel EEG. Biomed. Signal Process. Control 2024, 88, 105615. [Google Scholar] [CrossRef]
Shi, L.; Gui, R.; Wang, L.; Li, P.; Niu, Q. A Multi-Task Deep Learning Approach for Simultaneous Sleep Staging and Apnea Detection for Elderly People. Interdiscip. Sci. Comput. Life Sci. 2025, 1–16. [Google Scholar] [CrossRef] [PubMed]
Cacioppo, J.T.; Tassinary, L.G. Inferring psychological significance from physiological signals. Am. Psychol. 1990, 45, 16. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Guo, J.; Zhao, S.; Fu, M.; Duan, L.; Wang, G.-H.; Chen, Q.-G.; Xu, Z.; Luo, W.; Zhang, K. Unified multimodal understanding and generation models: Advances, challenges, and opportunities. arXiv 2025, arXiv:250502567. [Google Scholar]
Aggarwal, N.; Ahmed, M.; Basu, S.; Curtin, J.J.; Evans, B.J.; Matheny, M.E.; Nundy, S.; Sendak, M.P.; Shachar, C.; Shah, R.U. Advancing artificial intelligence in health settings outside the hospital and clinic. NAM Perspect. 2020, 2020. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the convergence of fedavg on non-iid data. arXiv 2019, arXiv:190702189. [Google Scholar]
Ma, X.; Zhu, J.; Lin, Z.; Chen, S.; Qin, Y. A state-of-the-art survey on solving non-iid data in federated learning. Future Gener. Comput. Syst. 2022, 135, 244–258. [Google Scholar] [CrossRef]
Mammen, P.M. Federated learning: Opportunities and challenges. arXiv 2021, arXiv:210105428. [Google Scholar]
Li, L.; Fan, Y.; Tse, M.; Lin, K.-Y. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
Smith, V.; Chiang, C.-K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Marfoq, O.; Neglia, G.; Bellet, A.; Kameni, L.; Vidal, R. Federated multi-task learning under a mixture of distributions. Adv. Neural Inf. Process. Syst. 2021, 34, 15434–15447. [Google Scholar]
Fusco, P.; Errico, P.; Venticinque, S. Federated Learning Algorithm for Identification of Apnea Sleeping Disorder. In International Conference on Advanced Information Networking and Applications; Springer: Berlin/Heidelberg, Germany, 2025; pp. 253–261. [Google Scholar]
Lebeña, N.; Blanco, A.; Casillas, A.; Oronoz, M.; Pérez, A. Clinical Federated Learning for Private ICD-10 Classification of Electronic Health Records from Several Spanish Hospitals. Proces. Leng. Nat. 2025, 74, 33–42. [Google Scholar]
Sun, T.; Li, D.; Wang, B. Decentralized federated averaging. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4289–4301. [Google Scholar] [CrossRef] [PubMed]
Eldele, E.; Chen, Z.; Liu, C.; Wu, M.; Kwoh, C.-K.; Li, X.; Guan, C. An attention-based deep learning approach for sleep stage classification with single-channel EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 809–818. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.-Q.; Cui, L.; Mueller, R.; Tao, S.; Kim, M.; Rueschman, M.; Mariani, S.; Mobley, D.; Redline, S. The National Sleep Research Resource: Towards a sleep data commons. J. Am. Med. Inform. Assoc. 2018, 25, 1351–1358. [Google Scholar] [CrossRef] [PubMed]
Quan, S.F.; Chan, C.S.; Dement, W.C.; Gevins, A.; Goodwin, J.L.; Gottlieb, D.J.; Green, S.; Guilleminault, C.; Hirshkowitz, M.; Hyde, P.R. The association between obstructive sleep apnea and neurocognitive performance—The Apnea Positive Pressure Long-term Efficacy Study (APPLES). Sleep 2011, 34, 303–314. [Google Scholar] [CrossRef] [PubMed]
Quan, S.F.; Howard, B.V.; Iber, C.; Kiley, J.P.; Nieto, F.J.; O’Connor, G.T.; Rapoport, D.M.; Redline, S.; Robbins, J.; Samet, J.M. The sleep heart health study: Design, rationale, and methods. Sleep 1997, 20, 1077–1085. [Google Scholar] [CrossRef] [PubMed]
Alvarez-Estevez, D.; Rijsman, R. Haaglanden medisch centrum sleep staging database (version 1.1). PhysioNet 2022. [Google Scholar] [CrossRef]
Alvarez-Estevez, D.; Rijsman, R.M. Inter-database validation of a deep learning approach for automatic sleep scoring. PLoS ONE 2021, 16, e0256111. [Google Scholar] [CrossRef] [PubMed]
Goldberger, A.L.; Amaral, L.A.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.-K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; Fonseca, P.; van Dijk, J.P.; Long, X.; Overeem, S. The use of respiratory effort improves an ECG-based deep learning algorithm to assess sleep-disordered breathing. Diagnostics 2023, 13, 2146. [Google Scholar] [CrossRef] [PubMed]
Thakur, A.; Gupta, M.; Sinha, D.K.; Mishra, K.K.; Venkatesan, V.K.; Guluwadi, S. Transformative breast Cancer diagnosis using CNNs with optimized ReduceLROnPlateau and Early stopping Enhancements. Int. J. Comput. Intell. Syst. 2024, 17, 14. [Google Scholar]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:161005492. [Google Scholar]
Luping, W.; Wei, W.; Bo, L. CMFL: Mitigating communication overhead for federated learning. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–10 July 2019; pp. 954–964. [Google Scholar]
Zheng, S.; Shen, C.; Chen, X. Design and analysis of uplink and downlink communications for federated learning. IEEE J. Sel. Areas Commun. 2020, 39, 2150–2167. [Google Scholar] [CrossRef]
Ehrhart, J.; Ehrhart, M.; Muzet, A.; Schieber, J.; Naitoh, P. K-complexes and sleep spindles before transient activation during sleep. Sleep 1981, 4, 400–407. [Google Scholar] [CrossRef] [PubMed]
Shahrbabaki, S.S. Sleep Arousal and Cardiovascular Dynamics. Ph.D. Thesis, The University of Adelaide, Adelaide, Australia, 2020. [Google Scholar]
Perslev, M.; Darkner, S.; Kempfner, L.; Nikolic, M.; Jennum, P.J.; Igel, C. U-Sleep: Resilient high-frequency sleep staging. NPJ Digit. Med. 2021, 4, 72. [Google Scholar] [CrossRef] [PubMed]
Phan, H.; Andreotti, F.; Cooray, N.; Chén, O.Y.; De Vos, M. SeqSleepNet: End-to-end hierarchical recurrent neural network for sequence-to-sequence automatic sleep staging. IEEE Trans. Neural Syst. Rehabil. Eng. 2019, 27, 400–410. [Google Scholar] [CrossRef] [PubMed]
Phan, H.; Chén, O.Y.; Tran, M.C.; Koch, P.; Mertins, A.; De Vos, M. XSleepNet: Multi-view sequential model for automatic sleep staging. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5903–5915. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; Fonseca, P.; van Dijk, J.; Overeem, S.; Long, X. Assessment of obstructive sleep apnea severity using audio-based snoring features. Biomed. Signal Process. Control 2023, 86, 104942. [Google Scholar] [CrossRef]
Zarei, A.; Asl, B.M. Performance evaluation of the spectral autocorrelation function and autoregressive models for automated sleep apnea detection using single-lead ECG signal. Comput. Methods Programs Biomed. 2020, 195, 105626. [Google Scholar] [CrossRef] [PubMed]
Olsen, M.; Mignot, E.; Jennum, P.J.; Sorensen, H.B.D. Robust, ECG-based detection of Sleep-disordered breathing in large population-based cohorts. Sleep 2020, 43, zsz276. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Yuan, X.; Li, P. On convergence of fedprox: Local dissimilarity invariant bounds, non-smoothness and beyond. Adv. Neural Inf. Process. Syst. 2022, 35, 10752–10765. [Google Scholar]
Jin, C.; Chen, X.; Gu, Y.; Li, Q. Feddyn: A dynamic and efficient federated distillation approach on recommender system. In Proceedings of the 2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS), Nanjing, China, 10–12 January 2023; pp. 786–793. [Google Scholar]

Figure 1. The overall architecture of the FMTL framework.

Figure 2. Detailed architecture of the shared CNN–GRU–attention encoder and task-specific heads used in the FMTL model.

Figure 3. Population screening of the APPLES dataset.

Figure 4. Sleep stage confusion matrixes of the three clients (APPLES, SHHS_rest, and HMC).

Figure 5. SDB severity classification confusion matrixes of the three clients (APPLES, SHHS_rest, and HMC dataset).

Figure 6. Comparison with different common classification strategies. Transparent colors represent no privacy-preserving capability. (a) Single. (b) Mix. (c) Fedavg.

Table 1. Sleep stage details of the three datasets used in our experiments.

Dataset	Subjects	Sampling Rate	Epochs (W/N1/N2/N3/R)	OSA Severity Segments (Normal/Mild/Moderate/Severe)
APPLES	382	100 Hz	110,443/15,670/388,581/191,345/67,642/	224/66/74/18
SHHS	329	125 Hz	46,319/10,304/142,125/60,153/65,953	180/62/54/33
SHHS	5463	125 Hz	445,627/61,898/665,508/222,570/241,922	812/1146/2473/1824
HMC	151	256 Hz	23,315/15,441/49950/26,640/21,191	55/33/27/36

Note that, in the APPLES dataset, we only concluded active patients for which sampling rate is 100 Hz. SHHS_pretrain (SHHS dataset N = 329) used for pretraining only. W, N1, N2, N3, and R indicate the number of 30 s epochs for each sleep stage on each dataset.

Table 2. Training hyperparameters.

Parameter	Value	Parameter	Value
Learning rate	1 × 10⁻³	Dropout rate	0.3
Optimizer	Adam	Weight decay	1 × 10⁻⁴
Batch size	32	Task loss weights (λ₁, λ₂)	0.5, 0.5
Local epochs (E)	5	Training time limit $τ_{k}$	Uniform [60, 180] seconds
Communication rounds (T)	100	Bandwidth $b_{k}$	Uniform [50, 500] KB/s
Client participation (C)	0.2 $\times K$	Dropout probability $p_{d r o p}$	0.1

Table 3. The performance of clients on the sleep stage classification task by using the FMTL framework.

Dataset	Per-Class F1 Score					Overall Metrics
Dataset	W (%)	N1 (%)	N2 (%)	N3 (%)	REM (%)	Recall (%)	Specificity (%)	Acc (%)	MF1 (%)	MGm (%)	κ
APPLES	88.3	53.7	93.5	70.6	77.3	74.7	96.1	85.3	76.7	83.7	0.71
SHHS_rest	88.1	57.8	89.5	85.6	86.4	82.5	96.4	87.1	81.5	89.0	0.86
HMC	81.9	61.4	82.4	80.3	79.1	76.7	94.6	79.3	77.0	85.0	0.72

Table 4. The performance of three datasets on the sleep SDB severity classification task by using the FMTL framework.

	Per-Class Macro-AUC				Overall Metrics
	Normal	Mild	Moderate	Severe	Recall (%)	Specificity (%)	Acc (%)	MF1 (%)	MAUC (%)
APPLES	85.2	71.1	90.3	82.8	73.4	91.3	79.2	74.6	82.4
SHHS	86.7	82.2	90.6	94.9	82.7	94.5	84.5	83.8	88.6
HMC	92.1	82.9	89.5	95.4	84.8	95.0	85.4	85.0	89.9

Table 5. The performance of three datasets for sleep SDB severity classification task by using FMTL framework.

Client (Dataset)		Sleep Stages (%)						OSA Severity (%)
Client (Dataset)		Recall	Specificity	Acc	MF1	MGm	κ	Recall	Specificity	Acc	MF1	MAUC
APPLES	Single	66.8	80.2	78.2	69.1	79.9	0.68	67.5	82.1	70.0	69.2	78.5
	Mix	71.0	85.3	83.5	72.2	81.6	0.69	69.4	88.0	74.2	70.8	79.9
	FMTL	74.7	96.1	85.3	76.7	83.7	0.71	73.4	91.3	79.2	74.6	82.4
SHHS	Single	76.4	94.5	78.9	75.2	84.3	0.78	77.2	81.9	78.4	77.2	80.6
	Mix	81.4	95.7	85.1	79.4	88.2	0.82	82.1	94.4	83.9	83.5	85.1
	FMTL	82.5	96.4	87.1	81.5	89.0	0.86	82.7	94.5	84.5	83.8	88.6
HMC	Single	68.2	84.1	72.5	73.9	80.6	0.69	79.4	83.9	82.5	83.8	85.1
	Mix	73.1	90.5	77.4	73.2	82.1	0.70	80.0	88.2	83.1	83.9	86.7
	FMTL	76.7	94.6	79.3	77.0	85.0	0.72	84.8	95.0	85.4	85.0	89.9

Table 6. Summary of literature for comparison.

Authors/Network Name	Method	Input	Output	Dataset	Results
U-Sleep [46]	CNN model	Majority vote	Sleep stages	SHHS	MF1 score 80.0%
SeqSleepNet [47]	CNN model and BiLSTM	C4-A1, EOG, EMG	Sleep stages	SHHS	MF1 score 78.5%, κ 0.81
SSC-SleepNet [17]	CNN and ResNet	EEG	Sleep stages	SHHS	MF1 score 84.0%, κ 0.86
XSleepNet1 [48]	CNN and RNN model	C4-A1, EOG, EMG	Sleep stages	SHHS	MF1 score 80.7%, κ 0.83
Xie et al. [49]	An extreme gradient boosting classifier	Demographic features and features from overnight snore patterns	AHI estimation	Full-night audio signals from 172 subjects, cross-validation	Spearman’s correlation = 0.786
Zarei and As [50]	Random forest classifier	Features using autoregressive modeling and spectral autocorrelation from ECG	Segment (60 s) based classification	ECG from 70 subjects (Apnea-ECG database) cross- validation	Accuracy = 0.94, sensitivity = 0.92, specificity = 0.95
Olsen et al. [51]	RNN model	IBI and EDR	Event based detection	9869 recordings from different datasets, 1051 for testing	Sensitivity = 0.709, specificity = 0.734, F1 = 0.721

Note: For the majority vote in U-Sleep, the hypnograms were generated using predictions from all available EEG-EOG channel combinations within each record.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, S.; Tang, R.; Wang, Y.; Wang, Z. Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis. Appl. Sci. 2025, 15, 8077. https://doi.org/10.3390/app15148077

AMA Style

Lin S, Tang R, Wang Y, Wang Z. Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis. Applied Sciences. 2025; 15(14):8077. https://doi.org/10.3390/app15148077

Chicago/Turabian Style

Lin, Songlu, Renzheng Tang, Yuzhe Wang, and Zhihong Wang. 2025. "Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis" Applied Sciences 15, no. 14: 8077. https://doi.org/10.3390/app15148077

APA Style

Lin, S., Tang, R., Wang, Y., & Wang, Z. (2025). Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis. Applied Sciences, 15(14), 8077. https://doi.org/10.3390/app15148077

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Fusion Multi-Task Learning Network Based on Federated Averaging for SDB Severity Diagnosis

Abstract

1. Introduction

2. Methods: FMTL Framework

2.1. FedAvg Algorithm

2.2. Multi-Task Learning Framework

3. Materials and Experiment Design

3.1. Dataset Description

3.1.1. APPLES Dataset [33,34]

3.1.2. SHHS Dataset [33,35]

3.1.3. HMC Dataset [36,37,38]

3.2. Experiment Design

3.3. Evaluation Metrics

4. Results

4.1. Sleep Staging Performance

4.2. SDB Severity Classification Performance

4.3. FMTL Framework Performance

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI