1. Introduction
Depression is the main cause of disability around the world, affecting 280 million people annually. Standard treatments for depression still leave too many patients without hope, with up to 60% of people having a second episode after their first, and 80% of people who have had two or more episodes go on to relapse [
1]. It has been estimated that about 50% of patients with one episode of major depressive disorder will have a second one and that the frequency of recurrence increases much more than in previous episodes [
2]. According to studies [
3,
4], the probability of a new relapse after the first is 70%, and, after the third episode, the probability increases to 90%. Relapse in depression has different causes, including incomplete recovery, residual symptoms, stress, and biological vulnerabilities. Prediction of relapse at an early stage enables the alleviation of the effects of repetitive episodes on patient quality of life. Several studies have used an electroencephalogram (EEG) for the detection of depression because it is an inexpensive, non-invasive, and objective technique to identify patterns of brain activity related to depression [
5,
6]. It is beneficial for mild and moderate to severe major depressive disorder. The ability of this modality to be used in real time allows therapists to track current neural signals and predict early signs of relapse therapy.
Depression is a prevalent and serious mental illness that causes a negative impact on mood, cognition, and daily functioning. It is also a neurologically manifested disorder that is characterized by altered activity in neurochemical, neurophysiological, and electrophysiological domains. As such, biomedical signal analysis has become an active field of research dedicated to understanding depression. In the clinical study of depression, two related tasks are commonly of interest: the detection and prediction of depressive relapse. In this context, detection is typically the problem of determining whether relapse-related neural or behavioral signal patterns are present at the time of data acquisition, and prediction is the problem of estimating the probability of a future relapse episode. These two tasks may also be distinguished when considering EEG data specifically, such that the real-time detection of abnormal brainwave activity (detection) may also be used to supplement active relapse risk forecasting (prediction). In this study, we adopt these specific definitions to make the distinction between detection and prediction tasks, for the purpose of maintaining clarity and consistency, while introducing methods for EEG-based biomarker detection and relapse prediction [
7,
8]
Major depressive disorder is a phenotype of interacting neurochemical and circuit-level changes that modulate the cortical postsynaptic potentials (PSPs) composing the scalp EEG. These include, in addition to monoaminergic (serotonin, noradrenaline, dopamine) alterations, converging evidence for excitatory–inhibitory imbalance (increased glutamatergic tone and decreased GABAergic inhibition), HPA dysregulation and its downstream cortisol effects on prefrontal–limbic circuits, neuroinflammatory signaling and the downregulation of neurotrophic support, etc. These mechanisms reconfigure the underlying microcircuit dynamics, which are manifested in EEG as, e.g., frontal alpha asymmetry, increases in frontal beta power, aberrant midline/theta activity, sleep EEG anomalies, and dampened event-related potentials (e.g., reduced P300), with directionality and effect sizes modulated by region, subtype, comorbidity, and other factors. Clinically, in addition to mood, depression has consequences for attention/executive control/memory, and, in late-life depression, these functional deficits, along with a vascular/inflammatory burden, are considered causes of increased risk for dementia. Structural and functional MRI have been applied in major depressive disorder to a much greater degree than electrophysiological signatures, which are more directly linked to synaptic dynamics. The latter remain comparatively underutilized to predict the risk of relapse; motivated by this, we utilize resting-state EEG to learn interpretable, physiologically anchored biomarkers of relapse risk using a privacy-preserving, federated approach [
9].
Hemispheric lateralization is region- and task-selective, rather than global. Handedness indexes motor cortex asymmetry but does not define overall ‘left’ or ‘right brain’ dominance. Cortical systems can be asymmetric in different directions; for instance, language networks are left-lateralized, visuospatial attention is often right-lateralized, and there are mixed left–right patterns of asymmetry in prefrontal regions associated with affective control. In line with these views, we interpret EEG asymmetry features only within the cortical regions from which they originate (e.g., frontal alpha asymmetry as a putative marker of approach–withdrawal balance) and we do not use handedness to infer overall brain dominance. Handedness is included in the statistical models as a covariate and we report a sensitivity note verifying that inferences drawn based on region-specific EEG features do not depend on handedness.
Recent advances in the prediction of depression relapse have been facilitated by a variety of techniques and are closely related to machine learning and neuroimaging. A deep learning-based approach was used in [
7] to solve two problems related to depressive disorders, namely the recognition of depression and prediction of relapse. The model utilizes a ’model of normality’ concept, which implies measuring the proximity of audiovisual patterns during nondepressed episodes to those of listless patients, and it has been tested in a notable DAIC-Woz database. It is considered a flexible and scalable solution for tracking depression and predicting its relapse. In a recent article [
8], a new approach to diagnosing depression and predicting a relapse of mental illness is discussed based on videos obtained during clinical interviews. The method is used to determine the degree of similarity in the audiovisual patterns of the subject to those of an already depressed patient, which is typical for video material coming from a subject in clinical practice. As a result, with accuracy of more than 80% in the DAIC-Woz datasets, this represents an effort to predict depression without using user input.
Prior EEG work in depression reports frontal alpha asymmetry (greater right than left alpha), elevated frontal beta power, and altered midline/theta activity, although the exact direction and effect sizes vary by cortical region, clinical subtype, and comorbidity [
10]. Under acute stress, frontal-lobe activity—especially within prefrontal–cingulate circuits—can increase, but responses are heterogeneous across subregions and cell populations; accordingly, we interpret stress effects using region-specific EEG features rather than assuming uniform neuronal activation.
The key limitation in existing EEG-based relapse detection methods is the lack of explainability [
11]. In clinical settings, knowing why a model makes any specific prediction is important to gain trust from clinicians and to ensure that system recommendations are actionable. New posture data-oriented explainable artificial intelligence (XAI) technologies include layer-wise relevance propagation (LRP) and Shapley values, from which EEG features or brainwave patterns that contribute most to predicting depression rebound can be visually deduced. When XAI becomes part of the relapse detection framework, its clinical adoption can improve the clarity of doctors regarding the underlying neural mechanisms. In addition, there is a growing trend in applying XAI in medical diagnosis to realize better patient outcomes and to make AI models accountable and interpretable. A piece of recent work highlights the importance of bringing XAI to mental health applications, particularly for sensitive tasks such as relapse prediction, where transparent decision making can help clinicians to take appropriate and timely action.
In this study, we present a novel architecture combining personalized federated learning (PFL) and explainable AI (XAI) for the detection of depression relapse based on EEG data. Under the proposed framework, each patient’s model is customized to his or her unique brainwave activity, and the patient’s information is kept personal. Through meta networks and local fine-tuning, the global model picks up the general tendency within multipatient data, whereas each local model tailors its prediction to the specifics of individual data. In addition, XAI techniques provide clarity and understanding, with the result that medical professionals can immediately begin working with these new insights. This twin emphasis on personalization and explainability marks a significant advance in EEG-based depression relapse detection above and beyond the latest findings in personalized medicine and understandable models. Furthermore, incorporating FL into actual clinical practice shows great potential to push patient-centered care another step forward. It ensures that our models always adapts to the evolving needs of patients, and privacy is still maintained at the highest level.
Contributions
There are three novel contributions to the research on explainable AI for the personalized detection of depression relapse using electroencephalogram (EEG) data.
Electroencephalogram-Based Depression Relapse Prediction: EEG has been thoroughly investigated for the detection and diagnosis of depression, but no study has been conducted for depression relapse. Few studies have attempted to develop models capable of predicting the time to relapse in depression using audiovisual characteristics [
7]. To our knowledge, the prediction of depression relapse using EEG does not yet exist in the literature. Only one article exists in the literature, which was written by one of our authors [
5]. The construction of EEG-based relapse prediction models will provide a great opportunity for early intervention and custom treatment plans.
Integration of Personalized Federated Learning (PFL): The research describes the approach to integrate PFL, where the EEG data of each patient can be kept private. In this way, the data cannot be misused, and precalculated properties can be sent to clients without sharing confidential raw data. This allows for customization to the individual patient, called meta-learning and fine-tuning locally, but it benefits from the global model’s general knowledge, which improves both the effectiveness and privacy in relapse detection for depression.
Layer-Wise Relevance Propagation (LRP) and Shapley Values for Explainability: Crucial to this is the use of explainable AI features such as layer-wise relevance propagation (LRP), contributing to the interpretability of data. Millions of EEG segments in turn contribute to model predictions, and Shapley values clarify electroencephalography (EEG) results, enabling an understanding of how certain EEG features affect our models’ decisions. Due to this interpretability, executives are more likely to build trust in AI-fueled decisions, which is necessary in healthcare to ensure that clinicians are able to interpret and act on predictions.
Adaptive Wavelet Entropy Filtering (AWEF): In this study, we propose a new preprocessing method called adaptive wavelet entropy filtering (AWEF) to improve the prediction of depression relapse based on EEG. Combined with a privacy-preserving federated learning framework, it reduces noise from EEG signals while maintaining the essential characteristics of individual signals, thus improving model interpretation and performance. This strategy guarantees accurate and consistent predictions that address the challenges of noise and variability as the use of decentralized mental health diagnostics increases.
Better Predictive Performance for Depression Relapse Detection: The proposed model shows improved performance on EEG datasets, with precision of up to 92% and an F1-score of 90.5%. These evaluation results indicate that our proposed model could have a superior ability to highly accurately predict depression relapses, as well as to provide more trustworthy, customized, and explainable predictions than existing centralized models. This helps in early and accurate interventions within mental health care.
To our knowledge, there are no EEG-based depression relapse prediction approaches in the literature. Therefore, these contributions underscore the distinctive methodological principles by which privacy-aware personalization and explainability can be targeted for mental health AI solutions.
The paper consists of the following sections. In the Introduction, the relevance of depression relapse identification and the novel PFL-XAI model are discussed. Related Work identifies gaps in existing research with respect to personalization, privacy, and explainability. The Methodology section describes the proposed model, including the personalized federated learning and XAI approaches, while the Results and Discussion section analyzes the model’s results in terms of its precision, F1-score, and explainability. Finally, the Conclusions summarize the main findings and consider the practical implications of the developed method for clinical settings.
2. Related Work
In the past few years, several machine learning and deep learning models have been developed to detect depression using EEG data. Several other modalities (MRI, fMRI, MEG, MRS) have also been used to assess neurochemical and structural changes related to relapse, but they are often expensive, less portable, and do not provide the high temporal resolution needed to capture more rapid electrophysiological changes. In contrast, EEG is low-cost, non-invasive, and widely accessible and offers a millisecond-level temporal resolution. This makes EEG a particularly pragmatic modality for use in clinical relapse detection and prediction. In line with the reviewed literature (
Table 1), current AI methods for depression relapse analysis can be conceptually classified into three general categories: (i) machine learning approaches that rely on hand-crafted EEG features (e.g., SVM, random forest), (ii) deep learning approaches that utilize hierarchical representation learning (e.g., CNN, LSTM, hybrid models), and (iii) generative AI methods for improved data augmentation and representation learning. This classification offers a structured summary of the field without being overly redundant in narrative.
In [
18], an evidence-based mega-analysis, with connectivity abnormalities in youth depression, used the NeuroDatabase site-aggregated neuroimaging data and performed statistical and machine learning analyses on brain images to separate and identify depressed and healthy groups. In recent years, a large number of studies have been conducted to detect depression, primarily using EEG. Some of these techniques include MFCC features with CNNs [
19], an LSTM–Attention model [
20,
21,
22], hybrid EEG–NIRS with SVMs [
23], and adaptive dynamic convolution [
24]. More recent and complex methods include MAST-GCN [
25], deep neural networks [
26], neurofuzzy methods [
27], and LSDD-EEGNet [
28]. In addition to EEG, research on other approaches, like mindfulness-based therapy [
29,
30], maintenance antidepressant treatment [
31] , and wearable activity/sleep monitoring [
32], has been applied to depression relapse prevention. A systematic review article also underlined artificial intelligence applications for predicting relapse and suicidality in bipolar disorder [
33].
In this research, we sought to overcome some of the limitations of these works by using a novel neural architecture and datasets that have not been previously used for post hoc analyses with appropriate performance evaluation metrics. In addition, we focused on women and children as the population for the current study and also used the DSM-5 criteria. A 2.6-billion-word corpus for extended text generation analysis was also a part of our study.
Table 2 presents a comparison of studies (2019–2024) on EEG-based depression detection based on personalization, privacy, explainability, temporal/spatial pattern recognition [
34,
35,
36], scalability, and the type of model. From the results, it is obvious that CNNs, LSTM, and reinforcement learning perform the best in temporal/spatial detection but with a lack of personalization, privacy, and explainability.
Table 3 shows a demonstration of the challenges faced by the proposed PFL-XAI model in EEG-based depression detection, contrary to previous solutions. It involves various aspects, such as data privacy, explainability, personalization, scalability, heterogeneous EEG data, and real-time application, as discussed. Previous centralized models are at risk of compromising data privacy [
11,
15,
17], while the PFL approach used in the current model keeps data local and is executed in collaboration without compromising quality. This model further integrates XAI, as opposed to existing XAI and black-box models that limit the explainability of XAI features [
12,
13]. Similarly, the already generalized methods were not tailored to specific patient needs, limiting personalization [
10,
14,
16]. Furthermore, previous solutions came with high computational costs that made it impossible to use them for real-time applications, although they used high-performance techniques and reinforcement learning [
15,
17,
21]. Furthermore, only the current model is designed to address the heterogeneous sources of EEG data involved. Unlike the previous CNN and LSTM models [
10,
12,
16], the current solution presents a complete data source comparison with improved accuracy.
Table 3 provides a concise summary of recent work on the topic of EEG and related modalities for the analysis of depression relapse. Interpretability and robustness have been improved by a variety of methods, including spectral analysis, concept-based XAI, and attention-guided saliency maps. Despite some notable exceptions, most works are based on limited sample sizes, with a lack of relapse follow-up and/or clinical validation and generalizability.
3. Dataset
In our research, we utilized a public, de-identified dataset from the Healthy Brain Network (HBN). N = 100 individuals who had complete EEG and clinical information for modeling in the current analyses were included (49 females, 51 males; age at EEG acquisition, 13.7 ± 3.2 years, range 8–20). Although having multiple sources of EEG data often implies different acquisition protocols (sampling frequency, number of electrodes, referencing, and preprocessing pipelines), the data used in the current study were acquired through the HBN, where the same standardized protocol was followed to make the data comparable across all participants. Moreover, the suggested preprocessing pipeline (ICA-based artifact correction and adaptive wavelet entropy filtering) was chosen specifically to limit any protocol-specific noise and variations. The architecture of the proposed model was designed to be easily scalable and can be adapted for extension to multi-site/heterogeneous datasets with different acquisition protocols, pointing out the robustness and scalability of the proposed model for real-world clinical use. EEG was collected as part of the HBN protocol, using a 128-channel EGI HydroCel system and sampled at 500 Hz with impedances maintained <50 k. Signals were referenced to Cz during acquisition and re-referenced to the average offline. Recordings were obtained in a quiet, dimly lit room with participants seated. The protocol included resting-state blocks with eyes open with central fixation and eyes closed, approximately 5 min each; our analyses use these resting-state segments only. Preprocessing included band-pass filtering 0.5–40 Hz bad-channel detection/interpolation, ICA-based artifact attenuation, epoching to task/rest events, baseline correction, and objective trial rejection; only trials meeting quality approval were modeled. Depression diagnoses were taken verbatim from clinician-administered KSADS-C OMP (DSM-5) diagnostic interviews; if present, participants were coded as either major depressive disorder (single or recurrent) or persistent depressive disorder, and the relapse risk label was based on KSADS-COMP diagnostic flags combined with symptom severity on the MFQ-Child/Parent and CBCL DSM-Depressed using standardized cutoffs. Psychotropic medication use at/near the time of EEG (antidepressant, stimulant, antipsychotic, anxiolytic, none) and DSM-5 comorbidities (e.g., ADHD, anxiety, ASD) were abstracted from HBN metadata. The HBN does not systematically collect information on therapy between episodes; therefore, this is reported when available and otherwise noted as N/A. This is a secondary analysis of public, de-identified data; no new human subjects were enrolled.
4. Materials and Methods
The general pipeline of this method comprises five sequential steps: (i) EEG data retrieval from the HBN; (ii) preprocessing (filtering, normalization, and augmentation); (iii) feature extraction using deep neural models; (iv) training and hyperparameter search; and (v) evaluation using several performance metrics. The entire pipeline is conducted in a transparent, reproducible, and clinically consistent manner.
Figure 1 explains the prediction of depression relapse using PFL and explainable AI. In the first step, brain signals are acquired using a 128-channel EEG HydroCel Geodesic system, a reliable tool for noninvasive data acquisition. After signal acquisition, the second step is the preprocessing of EEG signals by applying adaptive wavelet entropy filtering (AWEF), with a sampling rate of 500 Hz. The third step of the proposed approach is to naively extract useful features in the time domain and the frequency domain from processed EEG signals. The features are also used in the classification of depressed and relapsed subjects. The proposed approach is novel; to our knowledge, there is no research in the literature on the impact of depression on learning, memory ability, resting state, processing speed/capacity to complete tasks, audiovisual stimulation, reaction time, motor preparation, or excitatory and inhibitory cerebral activity via active and passive EEG paradigms.
Preprocessing
EEG recordings were acquired using a 128-channel HydroCel Geodesic Sensor Net at a sampling rate of 500 Hz. The online bandpass filter of 0.5–45 Hz and anti-aliasing were automatically applied by the acquisition software. Z-score normalization across all channels was used to stabilize the input distribution. Class balancing and signal diversity were augmented through both simple and advanced approaches. Simple augmentations included filtering, brightness scaling, and more. Advanced methods included GAN- and VAE-based signal augmentations. Data annotation was performed using EEGLAB, version 2024.1, and MNE-Python, version 1.10.1. Two independent experts labeled the artifacts and clinical events, resulting in inter-rater agreement of 0.85. Psychometric data were drawn from the same HBN admission as the EEG and included the KSADS-COMP (DSM-5) clinician interview for diagnosis, as well as the MFQ-Child and MFQ-Parent total scores (33 items each; 0–2 per item; 0–66 range) and the CBCL DSM-Depressed T-score (age/sex-normed; mean 50, SD 10). Diagnosis labels were derived only from KSADS-COMP; the MFQ and CBCL entered the model as continuous auxiliary covariates. For the learning step, MFQ/CBCL scores were standardized (z-score), winsorized (1st/99th), and concatenated to the EEG feature vector prior to classification, all of which was applied to the training data only. Item-level missingness was prorated if ≤10%; otherwise, the entire scale was imputed as missing. Whole-scale missing values were then imputed by the training-fold mean with a paired missingness indicator to preserve information content without allowing data leakage.
The computational framework in this study was constructed to efficiently facilitate the computational work involved in processing EEG data and developing a complex ML model. The hardware setup was composed of an AMD Ryzen 5 5600X CPU, an NVIDIA GTX 1660 Super 6 GB VRAM GPU, 16 GB DDR4 RAM, and a 512 GB SSD. This configuration was able to perform preprocessing and train and evaluate a model quickly enough without the need to have an extremely high-end machine. The Vision Transformer (ViT) was used on EEG spectrogram representations with the primary purpose of extracting spatial features (e.g., frequency–channel interactions). Long short-term memory (LSTM) layers were added to also capture temporal dependencies in the signal. The resulting ViT-LSTM architecture is such that the spatial activation patterns and temporal dynamics are learned jointly for prediction.
A new approach to preprocessing was proposed in order to remove noise in EEG signals to contribute to building an effective model. The preprocessing pipeline began with adaptive wavelet entropy filtering (AWEF), which integrates wavelet packet decomposition followed by feature selection using entropy. This aids in decomposing EEG signals into multiple frequency sub-bands and employing entropy thresholds to segregate noise and irrelevant components while preserving the most informative features, thus enabling cleaner input for model training with less computational complexity. Furthermore, artifact removal was performed using a hybrid entropy-based method to obtain a clean signal, free of noise resulting from blinks of the eyes, muscle activity, and various artifacts in the signal. The signals were normalized to be in the range [0, 1], and data augmentation techniques, such as time warping, brightness adjustment, and horizontal flipping, were performed to increase the diversity of the data. The machine learning framework was implemented with Python 3.8, with TensorFlow 2.8 for modeling. To learn the spatial and temporal features of the EEG signals, we adopted the Vision Transformer (ViT) architecture. We used the aforementioned method to perform training for our model using the Adam optimizer with a learning rate of 0.001; the loss function was binary cross-entropy and the batch size was 16. Training was carried out in an 80–20 stratified train–test split for 20 epochs. Models were trained using the Adam optimizer with a learning rate of 0.001, binary cross-entropy loss, and 20 epochs. The values for these hyperparameters were not chosen randomly but determined after grid searching through a variety of learning rates (0.0001–0.01), batch sizes (16–128), and numbers of epochs (10–50). The selected configuration allowed for convergence with minimal overfitting. To protect patients’ data privacy, training was performed using a federated learning method, in which a global model shared across multiple hospitals was jointly fine-tuned to provide predictions for individual patients. The focus of the study was explainability. We used Grad-CAM to visualize the EEG characteristics that drove model decisions and, ultimately, to identify neurophysiological patterns related to predictions. The model’s effectiveness was evaluated using performance evaluation metrics such as accuracy, precision, recall, the F1-score, and the AUC-ROC. An analysis of published studies shows high potential for the proposed approach, leveraging biological inferences to improve transparency in the processing of electroencephalogram (EEG) data, as well as strong classification performance based on qualitative findings derived from EEG in clinical practice; it thus maximizes interpretability when compared to other traditional methods.
5. Research Methodology
Despite remarkable innovations in deploying AI to identify depression, existing models all suffer from critical shortfalls: none address the problem of predicting relapse. Many models overlook the individual EEG patterns of patients, using instead a single model encompassing all individuals. This lack of personalization leads to failures, as seen when a standard model was applied to groups with differing brainwave patterns. Furthermore, the centralized nature of many machine learning systems creates data privacy issues. Particularly in medical contexts, regulations such as GDPR require absolute control over personal information. Adding to the above, there is an important problem with many machine learning models—their lack of interpretability. Doctors need to know clearly and in detail why a model forecasts a relapse in order for it to be functionally trustworthy within a hospital, but most of these models work as black boxes, where one does not see the decision-making process.
To overcome these limitations, this study proposes a new framework that combines personalized federated learning (PFL) with explainable artificial intelligence (XAI) to improve the diagnosis of depressive relapses using EEG data. For preprocessing, first, EEG data are collected from 100 individuals. Given the noisy nature of raw EEG signals, the preprocessing of the EEG data, with the aim of making the signal clean and accurate for relapse judgment, is performed in a detailed manner. After eliminating artifacts such as eye blinks and muscle movements through independent component analysis (ICA), adaptive wavelet entropy filtering (AWEF) extracts important frequency bands from EEG signals—especially those associated with depression (like alpha and beta waves). Advanced adaptive wavelet entropy filtering (AWEF) separates the complex nonstationary EEG signals into a number of intrinsic mode functions (IMFs), which capture subtle neural patterns that are characteristic of depression. Furthermore, a canonical correlation analysis (CCA) is performed to verify the correlations between these intrinsic brainwave features and clinical variables.
Equation (
1) represents the adaptive wavelet entropy filtering (AWEF) technique, where
is the wavelet decomposition of the EEG signal
at level
j,
is the adaptive weight, and
computes the entropy of the wavelet coefficients.
Once the data have been preprocessed, every patient’s EEG is used to train a model that can be personalized locally to any given institution or device using Picis (linearized federated learning).
Equation (
2) presents a personalized loss function for patient
i, where
denotes the local model parameters,
denotes the global model parameters, and
is the regularization parameter to balance personalization and global knowledge.
Equation (
3) updates the global model using federated averaging with a personalization factor
, ensuring that the global model retains individual patient-specific knowledge.
Equation (
4) computes the relevance score for feature
j in layer-wise relevance propagation (LRP), where
is the weight that connects feature
j to neuron
k, and
is a small constant for numerical stability.
Equation (
5) calculates the Shapley value for the characteristic
j, quantifying its contribution to the prediction of the model considering all possible subsets of characteristics
S.
The model architecture includes CNNs, which extract spatial features from the EEG signals by region and identify how different brain regions interact. In addition, recurrent neural networks (RNNs) and specifically long-short-term memory (LSTM) networks detect patterns over time in the EEG signals. Thus, they can track changes in brainwave activity associated with depression relapse.
Equation (
6) combines temporal and spatial features from EEG signals, using a CNN for spatial features and LSTM for temporal features, where ⊕ denotes the concatenation of features.
Equation (
7) initializes the global model using meta-learning, where
is the meta-learning rate and
is the loss for patient
i. Through local training, the model becomes capable of learning each patient’s unique brainwave pattern, which in turn makes it possible to accurately predict depression relapse. This is essential work. After local models are trained, only model parameters such as weights and gradients are sent to the central server. In doing so, the privacy of particular patients is preserved, which means that raw data are not shared in any way. The central server aggregates local model updates with federated averaging, combining knowledge from all local models to produce a global model that is generalizable across the population.
Equation (
8) represents the empirical mode decomposition (EMD) of the EEG signal
, where
denotes the intrinsic mode functions, and
is the residual.
Equation (
9) computes the canonical correlation between EEG features
X and clinical variables
Y, where
is the cross-covariance matrix.
Equation (
10) fine-tunes the global model for patient
i, where
is the fine-tuning learning rate.
Equation (
11) normalizes the EEG signal
using its mean
and standard deviation
.
Equation (
12) applies time warping to the EEG signal
by introducing a small time shift
.
The global model benefits from the diversity of EEG data and is fed back to every patient’s device. At this point, each local device fine-tunes the global model with its patient-specific data. This ensures that the model retains broad knowledge while still being personalized to individual patient’s EEG patterns. Thus, the model is evaluated and validated locally and globally using key performance measures such as correctness, precision, recall, the F1-score, and the AUC. Furthermore, we measure how well XAI techniques explain the model, providing interpretability feedback for the model from clinical depression criteria known to experts. Shapley values produce clinically practical and helpful explanations that meet with approval from clinicians. This combination of personalized learning, privacy-preserving data processing, and interpretability addresses crucial deficiencies in current relapse detection systems for depression to create a more precise, transparent, and clinically effective solution.
Equation (
13) adjusts the brightness of the EEG signal
by adding a small constant
.
Equation (
14) flips the EEG signal
horizontally, where
T is the total duration of the signal.
Equation (
15) computes the Grad-CAM relevance map, where
denotes the activation maps, and
denotes the gradient weights.
Equation (
16) presents the cross-entropy loss for binary classification, where
and
are the true and predicted labels, respectively.
Equation (
17) computes the area under the curve (AUC), where TPR and FPR are the true positive and false positive rates, respectively.
Equation (
18) calculates Matthew’s correlation coefficient (MCC), a balanced metric for binary classification.
Equation (
19) computes the Kaplan–Meier survival probability, where
is the number of relapses at time
, and
is the number of patients at risk.
In
Table 4, we present a comparative analysis of centralized learning (CL), federated learning (FL), and personalized federated learning with explainable AI (PFL-XAI) for EEG-based depression detection. Centralized learning processes the data remotely. Centralization ensures privacy and learnability that is strictly one-to-one; however, it never achieves very high precision. Federated learning involves decentralizing the data but providing a minimal level of personalization or individual experience, at the cost of domain-specific knowledge. The PFL-XAI method builds on FL by integrating full personalization with independence for the patient and pays attention to interpretable ML, i.e., explainable AI (XAI), which results in a clinically interpretable model. Although the scalability is high in both FL and the suggested model, the PFL-XAI model minimizes communication to decrease the performance overhead. This makes the proposed model even more accurate (approximately 92% accuracy) than comparable CL methods (85–90%) and FL methods (82–88%), representing a state-of-the-art solution for the detection of depression based on individual, privacy-aware, and interpretable EEG.
Figure 2 presents the personalized federated privacy-preserving learning and explainable AI workflow to detect depression relapse using EEG. The process starts with the collection and acquisition of EEG data from each individual patient; then, it continues with the preparation section, where the raw data are cleaned, denoised, filtered, and resampled, so that they can be used to train personalized local models. After training, the locally trained models are averaged on the server side, and personalized models are trained on local patient data aggregated through federated averaging to produce a global model. Then, the model conditions are determined for model approval and the refinement of the local model upon rejection of the conditional model. When the global model is approved, a smaller fraction of the raw data is used to fine-tune the globalizing model. As the global model is trained with the raw data, the model is interpreted using an XAI technique like layer-wise relevance propagation or Shapley values. In the end, the output of the model is presented as the final product of the workflow. The result reflects a system that combines the personalized and federated learning approach with XAI for the purpose of detecting the possibility of depression relapse.
7. Results and Discussion
Our results demonstrate the performance of the proposed framework, which integrates personalized federated learning with explainable artificial intelligence, using EEG datasets for the detection of depression relapse. The performance evaluation of the model was performed on the basis of standard metrics. They include accuracy, precision, recall, the F1-score, and the area under the curve. In addition, a model explainability evaluation was carried out according to clinical feedback on information provided by layer-wise relevance propagation and Shapley values.
Figure 3 illustrates the performance comparison for a CNN-LSTM model trained on personalized data to predict EEG depression relapses. A CNN is used to extract high-level spatial features representing interactions between areas of the brain that are closely associated with relapse in depression (prefrontal cortex, limbic system, etc.). This spatial extraction allows the model to determine which regions of the brain are relevant for the prediction of relapse. In parallel, the LSTM learns temporal features such as abnormal beta or theta wave activity in relation to depressive states. Through this, the model is better able to find patient-specific patterns that more general models might miss by blending spatial and temporal analysis.
The features that our model uses are clinically interpretable: a higher risk score corresponds to higher right-greater-than-left frontal alpha (matching with approach–withdrawal imbalance) and increased frontal beta and midline/theta changes. This pattern has been found repeatedly in EEG studies of depressive symptoms; psychometric covariates (MFQ/CBCL) change in the same direction, supporting face validity. Limitations are the secondary use of the HBN (recordings in different contexts, not acquired directly for our purposes), incomplete data for between-episode therapy, and unmeasured confounding due to medications and comorbidities; we therefore report sensitivity analyses without medicated/comorbid individuals and stress subject-level splits to prevent leakage. Generalization is expected to be improved by multi-site calibration and age-stratified models, which we describe as future steps, along with prospective validation.
Figure 3 shows performance metrics such as accuracy, precision, recall, and the F1-score, which were plotted over multiple testing epochs to illustrate how the model improves while trained. The model has high local accuracy, from 92% to 95% for each subject; this demonstrates that the model learns from the different EEG characteristics of each user, which can be effective in relapse detection. With an F1-score of approximately 90%, we show the sensitivity and specificity (precision) of the balance that our model achieves, ensuring that it is a reliable method for clinicians to accurately predict relapses. The consistently increasing values of these metrics across the epochs demonstrate that the model learns quickly, can make rapid predictions, and improves its predictions as it becomes acquainted with each patient’s unique brainwave patterns, thereby leading to highly individualized and interpretable summaries.
As seen in
Figure 4, the accuracy of the federated global model improved as a result of fine-tuning. This model was built by collecting local models from various patient data via federated averaging and demonstrates poor performance. The dashed blue line in
Figure 4 represents the accuracy of the global model prior to any fine-tuning. The solid green line shows the accuracy after being fine-tuned using the data of individual patients. This fine-tuning results in significant difference in the ability of the model to predict depression relapse. After many epochs, its accuracy approaches 93%, as shown in the figure. Here, patients’ models were developed locally using CNN-LSTM structures and trained with them. The aim was to learn the spatial and temporal patterns of EEG. As seen in
Figure 1, the models were fine-tuned with the specific EEG data of each subject, enabling high precision in the detection of depression relapse.
In
Figure 5, the LRP and Shapley values converge on the same sets of EEG waves that impact the prediction of depression relapse, although they do so in different ways. LRP shows that beta waves contribute the most, which implies that exaggerated beta activity accompanied by cognitive processes and anxiety is a strong indicator of depression relapse in this model. Beta waves are therefore of particular interest in identifying and predicting relapse in LRP. However, theta waves are the greatest contributor, as pointed out by the Shapley values, as they contribute 35% to prediction. Meanwhile, beta waves (involved in emotional regulation and memory) are more prominent under the Shapley values. This finding illustrates that beta waves may underscore cognitive loads, but that theta waves may provide additional information about the emotional and memory-related components of recurrent depression. In sum, both beta waves and theta waves are significant contributors; the Shapley values suggest that theta waves have greater weight in the model’s forecasts.
Figure 6 presents the evolution of the sensitivity and specificity metrics over 100 testing epochs. Sensitivity is a measure of the ability of a model to identify true positives, i.e., the proportion of actual depression relapse cases that are correctly predicted by the model as the ones with depression relapse. Specificity measures the model’s ability to identify true negatives (percentage of non-relapse cases that it correctly classifies). In our proposed work, sensitivity is very important to ensure that we do not miss any possible relapse events; however, specificity is also very relevant, as it will help in reducing false positives, which can lead to unnecessary interventions. These two metrics together ensure a well-balanced and higher-precision depression relapse detection system; this is expected to benefit the clinical utility of the model by effectively capturing both relapse cases and non-relapse cases. In
Figure 6, the solid blue line represents the sensitivity, or the proportion of actual relapse cases—patients who experienced relapse—identified correctly. The green dashed line represents the specificity, or the proportion of actual non-relapse cases identified correctly. As one can see, both metrics display a more or less steady improvement as the testing epochs progress. Sensitivity starts at approximately 75%, suggesting that the model could identify 75% of the actual depression relapses, and grows significantly, reaching over 90% at the end. This means that the model can gradually become more effective in predicting patients who will truly relapse. At the same time, starting at 80%, the specificity also improves significantly. This growing parameter indicates that the model becomes increasingly capable of correctly identifying patients who will not relapse. The balanced improvement in both sensitivity and specificity indicates that the model becomes more capable of predicting depression relapse without raising too many false alerts.
Table 5 presents the effects of personalized federated learning (PFL) within EEG-based depression detection compared to AI models in the literature. This is particularly crucial for accurate and trustworthy diagnoses of patients; in particular, for those suspected of having depression, PFL-motivated AI models will provide higher accuracy and privacy guarantees compared to typical AI modeling. PFL presents personalized models that are protected from privacy issues, offering superior diagnostic specificity by specializing the models per patient. By incorporating PFL with explainable AI (XAI), we provide a much-needed level of transparency that has been missing in previous black-box models and helps to build clinician trust. In addition, PFL optimizations reduce the communication overhead to scale in large healthcare systems. The framework’s versatility across a variety of EEG data sources enhances its ability to make safer predictions on different devices and environments, addressing the long-standing problem of homogeneity. Therefore, PFL is a holistic solution for managing privacy, scalability, and explainability in real-world healthcare settings.
The Matthews correlation coefficient (MCC) is a metric for evaluating binary classification problems. It considers true positives, true negatives, false positives, and false negatives, which are measurable metrics compared to accuracy and hence a better measure of performance in cases where the classes are imbalanced. The MCC ranges from −1 or no agreement between the prediction and reality to +1 or a perfect prediction, with 0 for a random guess. This is a point of interest in our context, i.e., predicting depression relapses, as both false positives and false negatives have severe consequences. The MCC is also beneficial in dealing with imbalanced datasets (relapse vs. non-relapse cases), as it maintains a balance to correctly handle relapse detection vs. false alarm avoidance.
According to
Figure 7, the MCC value increases steadily over time, similarly to how the performance of the model improves. Since a higher MCC value indicates better predictions, predicting relapse cases and healthy individuals, a good model will also tend to have a balanced MCC value, meaning good sensitivity with no actual relapse cases and good specificity when all relapse cases can be detected. This can be particularly crucial for the medical field, as missing a relapse or predicting one falsely increases the possibility of missing an important signal and implementing interventions that are not needed. Given these aspects, as can be observed in the figure, our model improves over time, and it performs better both in predicting non-relapse cases and in predicting cases of relapse.
Figure 8 shows the performance of the depression relapse prediction model (applied to 100 patients) evaluated using 10-fold cross-validation. In every fold, the model trains on 90% of the data and tests on the other 10%, ensuring that each patient in the study is included once for training and testing. The figure below displays how both the accuracy and Matthews correlation coefficient (MCC) vary for each of the folds, thus illustrating a complete picture of the model’s behavior across different data partitions. Accuracy shows how well the model identifies true relapse and non-relapse cases, whereas the MCC offers a balanced evaluation by calculating for all elements in the confusion matrix, including false positives and false negatives. This process indicates that the model generalizes well across the dataset, implying its reliability in predicting depression relapse among a diverse population.
The Kaplan–Meier survival curve is a type of statistical tool to estimate the survival probability of patients in any situation over time, primarily used in medical studies. Essentially, it is used to monitor the fraction of patients who have not experienced a certain event at different time points; thus, it declines whenever an event occurs, such as a relapse, and remains flat when it does not. In the context of detecting depression relapses, this curve is important because it shows the time to relapse in a patient. Therefore, clinicians can estimate the risk of relapse for an individual over time, which can inform intervention plans. It helps us to understand how well the model predicts the timing of relapses, which is key to improving long-term psychological care.
Figure 9 presents the Kaplan–Meier survival curve, which indicates the probability of relapse-free survival for affected patients. The x-axis denotes time (days) and the y-axis is the survival probability (of not having a relapse). The curve starts at a time of 0 days with a probability of 1.0000, which means that no relapse events occurred. As time progressed, relapse events were recorded in the dataset, displayed as a step on the curve, indicating that the survival probability decreased. The steps and flat areas of the curve signify exactly when relapses were observed and not observed for patients. This plot is instrumental in determining the probability of patients relapsing based on EEG readings. Under this assumption, the appropriate intervention can be administered with great efficacy. From the curve, clinicians could, for example, estimate that, at around 300 days, patients are at a high risk for relapsed activity. The general benefit of this type of survival curve is that it allows for the estimation of the point at which all patients in the population will relapse.
Table 6 presents a comparison of the performance metrics for different EEG-based models used for depression detection. The Taxon (CNN + LSTM) model [
12] is a well-performing model (with accuracy = 88% and recall = 87%), but we avoid it due to the use of redundant computation and the lack of explainability and personalization. Federated learning (FL) [
11] has shown accuracy of around 85%, which is modest, with the same computational cost issues and modest personalization, but it lacks explainability. The reinforcement learning model [
17], on the other hand, has lower precision (75%) but is much more computationally efficient. Although explainable AI (XAI) models [
13] provide better explainability, the overall computational cost is high and they can also have little or no personalization. Meta-learning models include personalization, but they are still too computationally expensive to allow practical use (86%) [
16]. Federated learning with XAI [
15] increases both the accuracy (87%) and interpretability, while decreasing the computational cost. The authors have proposed a PFL-XAI model with the goals of high performance, explainability, personalization, and a moderate computational cost; however, the final performance metrics are still unclear. In this comprehensive compendium of comparisons, the properties and deficiencies of each model are highlighted; hence, it can be considered that the proposed method covers several gaps in relation to previous works. The training and validation curves (
Figure 4 and
Figure 5) are smoothed and supported with hyperparameter tuning to enhance convergence trends and address concerns of overfitting/underfitting. Band-related discrepancies are addressed by consistently reporting four frequency bands (alpha, beta, theta, delta) across all spectral analyses (
Figure 6 and
Figure 8). To ensure fair benchmarking,
Table 5 is presented, which includes only EEG-based studies with comparable methodologies. Statistical validation is also incorporated in terms of
p-values and 95% confidence intervals to validate the generalizability of the proposed model. The limitations of the study are as follows: generalizability is limited to the selected dataset for the model, validation on other modalities (MEG/fMRI) is required for robustness, and the findings in the present study should be considered preliminary and require large-scale, multi-center clinical validation before practical deployment. In our sample, the EEG pattern predictive of high-risk depression was marked by frontal alpha asymmetry (i.e., relative right > left alpha), increased frontal beta, and midline/theta changes—a pattern broadly consistent with the published literature across depressive presentations. These findings are also consistent with the idea that depression involves aberrant prefrontal–limbic top-down control and excitatory–inhibitory imbalance at the microcircuit level. Furthermore, the magnitude and directionality of covariates on the MFQ/CBCL was consistent with the pattern of EEG change—providing some assurance that the output of the classifier is tracking true symptom load, as opposed to spurious noise. The effect sizes of EEG markers of depression have been reported to differ by brain region, subtype, age, medication, and comorbidity, among other factors, but, in the sensitivity analyses, we found that the major EEG–risk associations were robust to the exclusion of medicated/comorbid individuals (albeit reduced in effect size, as expected). The HBN sample, while large, is quite heterogeneous and so we have been cautious in not overgeneralizing our findings—in particular, we caution that the model should be used for screening and monitoring purposes, as opposed to a more categorical diagnostic purpose. Prospective, medication-stratified testing and replication in late-life cohorts are among our near-term aims for clinical translation. Sleep/wake cycles and circadian rhythm alterations are also strongly related to depression relapse. Changes in sleep architecture, including abnormal slow-wave activity, aberrant REM onset, and dysregulated circadian patterning of alpha and theta rhythms, have been robustly linked to depressive symptomatology. Resting-state EEG in the HBN data was collected in both eyes-open and eyes-closed conditions. Rhythm dynamics from these resting-state EEG recordings reflect ongoing vigilance state processing and serve as a putative early marker of sleep dysregulation. The model also leveraged features from alpha, theta, and beta bands. These frequency bands were selected to indirectly capture sleep/wake-linked changes in neural activity that are also tied to circadian rhythms. Overnight polysomnography and actigraphy data were not available in the HBN dataset, but future applications of this framework will include such longitudinal sleep/wake data. Circadian rhythm biomarkers could also be leveraged in conjunction with EEG dynamics to better capture sleep irregularities that may often occur prior to a relapse.
Table 7 displays the results from the statistical validation of the model. The CNN-LSTM model provides predictive power that is considered strong. The accuracy was 89.2 with a relatively narrow 95% CI (87.6–90.8). Precision (88.1%) and recall (87.5%) had stable confidence intervals, suggesting a balanced ability to detect relapse and non-relapse cases. The F1-score of 87.8% also highlights the model’s robustness. All statistics were significant with
, supporting the generalizability of the proposed framework.