1. Introduction
Mental health disorders continue to pose a significant and expanding challenge to global health, with current estimates suggesting that around 970 million individuals are affected and that these conditions account for more than 14% of the total years lived with disability worldwide (WHO, 2022 [
1]; Kestel et al., 2022 [
2]). Although awareness of mental health needs has increased, access to timely, personalised, and ethically delivered care remains inadequate, especially in settings with limited resources and in communities with restricted digital access.
The rapid growth of digital health solutions has created new opportunities for detection, ongoing monitoring, and more tailored support. Nevertheless, the benefits of these developments are limited by the sensitive nature of mental health information, which is frequently dispersed across mobile phones, wearable devices, and clinical platforms. This fragmentation creates complications for data sharing, reliable integration, and consistent clinical applicability (Karagarandehkordi et al., 2025 [
3]).
Federated learning provides a privacy-aware alternative to traditional machine learning because it allows models to be trained across separate data sources without the need to move raw information to a central server (McMahan et al., 2017) [
4]. This approach aligns naturally with the distributed and varied nature of data found in mental health settings. When paired with edge and cloud-based computing, federated learning systems can offer real-time inference, reduce communication demands, and give users greater control over their personal information [
5]. Despite these advantages, practical adoption of federated learning within mental health remains limited. Questions surrounding scalability, inclusivity across different diagnostic groups, and the strength of privacy protections are still largely unanswered (Dubey et al., 2025) [
6].
Although earlier reviews have explored federated learning in the broader healthcare domain (Zhou et al., 2021 [
7]; Dhade & Shirke, 2024 [
8]), few have specifically addressed the distinctive clinical, technical, and ethical issues that arise in mental health contexts. More recent reviews focused on this area (Khalil et al., 2024 [
9]; Grataloup & Kurpicz-Briki, 2024 [
10]) highlight encouraging use cases but also reveal gaps in realistic deployment, the modelling of comorbid conditions, and the integration of multiple data types.
To address these gaps, this review synthesises 17 empirical studies that apply FL in mental health settings with explicit integration of edge, fog, or cloud computing. All candidate studies were evaluated using a structured five-question quality checklist, and only those scoring at least 7/10 were retained for detailed synthesis (see
Section 3.4 and Appendices
Appendix A.1 and
Appendix A.2). This study is guided by four research questions:
- RQ1:
How have federated learning, cloud, and edge computing been implemented and evaluated in mental health systems?
Rationale: Examining the strategies used to design and assess these systems is essential for determining whether federated learning, combined with cloud and edge computing, can be applied effectively and reliably in real-world mental health environments.
- RQ2:
How diverse is the data used to predict mental health risks?
Rationale: Understanding the range of data sources, including demographic, clinical, and behavioural information, is important for assessing how data heterogeneity influences model generalisation and predictive accuracy.
- RQ3:
What privacy and security techniques are adopted across FL frameworks?
Rationale: Evaluating the privacy-preserving methods used in these systems helps determine whether strong confidentiality can be maintained while still achieving reliable predictive performance in sensitive mental health settings.
- RQ4:
What challenges and limitations do studies report regarding scalability, evaluation, and deployment?
Rationale: Identifying barriers such as technical limitations, regulatory considerations, and diagnostic constraints is relevant to understanding the practical readiness to develop these systems, in addition to providing insight into areas where further development is needed to support secure and scalable mental health prediction
By examining these areas in detail, the aim of the review is to guide the development of federated learning systems that are scalable, respectful of privacy, and meaningful for clinical use in digital mental health. The remainder of the paper is structures as follows.
Section 2 outlines the research background and related literature.
Section 3 describes the methodology and the search process.
Section 4 summarises the included studies, followed by a synthesis of the main observations in
Section 5.
Section 6 presents the findings in relation to the research questions.
Section 7 discusses their implications and outlines directions for future research.
3. Systematic Literature Review Methodology
This systematic review was conducted to critically synthesise empirical research investigating the integration of federated learning (FL), edge/cloud computing, and privacy-preserving AI within mental health contexts. The review methodology followed the PRISMA 2020 framework [
22] to ensure a transparent and reproducible workflow, from literature identification through selection, assessment, and synthesis. The PRISMA diagram illustrating the screening process is shown in
Figure 1. All the PRISMA checklists and workflow processes (following guidelines from [
22]) are available as
Supplementary Materials to ensure transparency and reproducibility.
In addition, the review was designed and conducted in accordance with the systematic literature review procedures outlined by Kitchenham and Charters [
23], who provide structured guidance for planning, executing, and reporting SLRs in software engineering and computing research.
3.1. Search Strategy
Automated and manual search strategies were used to capture relevant empirical studies. The automated search was run on five major academic databases (ACM Digital Library, IEEE Xplore, ScienceDirect, SpringerLink, and Scopus). The Boolean string combined four core domains—mental health, FL, edge/cloud computing, and privacy/security—and was applied to titles, abstracts, and keywords:
(“mental health” OR “depression” OR “anxiety”) AND (“federated learning”) AND (“edge computing” OR “cloud computing”)
Although the string foregrounded ‘mental health’, ‘depression’, and ‘anxiety’, several of the final studies also focused on related or comorbid conditions such as stress, epilepsy, Alzheimer’s disease, autism, and chronic disease.These papers entered the pool in two ways: (i) through database records where these conditions were explicitly framed as part of mental health monitoring or neurological decline in conjunction with edge or cloud deployment and (ii) via backward snowballing from the reference lists of FL–mental health–edge/cloud articles retained at the full-text stage.
This reduced reliance on the initial diagnostic keywords alone and helped surface work where mental health was described more broadly (e.g., ‘stress’, ‘cognitive decline’, and ‘chronic disease’) and where cloud, fog, or edge infrastructure was described in the methods rather than in the title.
Privacy and security were treated as conceptually important but were not added to the Boolean string to avoid overly restrictive filtering at the retrieval stage. Instead, these aspects were captured through full-text screening and data extraction. The automated search was complemented by a backward snowballing procedure in which the reference lists of all full-text articles were checked for additional eligible studies [
24,
25]. Nonetheless, there remains a residual risk that FL studies involving mental health-relevant conditions or edge/fog deployments but lacking explicit mental health or edge/cloud terminology in titles or abstracts were missed; this limitation is considered when interpreting the scope of the review.
3.2. Screening and Selection Process
The initial search retrieved 1021 unique records, which were imported into the Rayyan AI platform for duplicate removal and screening [
26]. In Phase 1, titles, abstracts, and keywords were screened by the lead author for relevance to the research questions, resulting in the exclusion of 992 records and leaving 29 studies for full-text review. To improve coverage, backward snowballing was then applied to the reference lists of these 29 papers, yielding two additional studies that met the preliminary criteria and bringing the full text pool to 31 articles.
In Phase 2, all full texts were assessed against the inclusion and exclusion criteria. Screening and quality-assessment decisions were made by the lead reviewer and checked by two academic supervisors (second and third authors). A total of 14 studies were excluded for methodological or topical reasons, resulting in 17 empirical studies being retained for final synthesis.
3.3. Inclusion and Exclusion Criteria
Inclusion Criteria
Peer-reviewed journal articles, conference proceedings, or scholarly book chapters;
Empirical studies such as experiments, case studies, simulations, feasibility trials, or evaluations;
Studies explicitly addressing one or more of the predefined research questions;
Full-text articles written in English;
Articles published up to January 2025.
Exclusion Criteria
Secondary studies, meta-analyses, opinion pieces, editorials, or responses;
Books or non-peer-reviewed literature;
Publications not addressing federated learning, mental health, or edge/cloud deployment;
Studies that failed to meet the quality assessment threshold (see
Section 3.4), which was developed by the authors based on Kitchenham and Charters’ guidelines for systematic reviews in software engineering [
23].
3.4. Quality Assessment
To ensure methodological rigour, each of the 31 full-text studies was evaluated using a structured five-question evaluation checklist developed by the authors in accordance with Kitchenham and Charters’ guidelines [
23]. The checklist was designed to capture both relevance and quality in the context of FL for mental health. Each question was scored as 2 (fully addressed), 1 (partially addressed), or 0 (not addressed):
Are the aims or objectives of the research clearly stated and relevant to mental health AI?
Is federated learning implemented or proposed, and are privacy or security concerns explicitly discussed?
Is cloud or edge computing integrated into the system for data processing, model training, or deployment?
Do the privacy or security techniques used directly contribute to the research goals (e.g., secure AI for mental health or privacy-preserving monitoring)?
Is the experimental setup (e.g., dataset, sample size, and performance metrics) clearly described and appropriate for mental health AI?
Studies receiving a cumulative score below 7 out of 10 were excluded. Following this quality assessment, 17 studies met the inclusion threshold and were selected for final synthesis. All studies were initially scored by the lead reviewer using this five-question checklist. A random subset of studies was independently assessed by the third author, and all scores and inclusion decisions were then reviewed by the second and third authors. Any uncertainties or borderline cases were resolved through discussion, so no study was excluded solely on the basis of a single reviewer’s judgement. A custom checklist was adopted because existing formal risk-of-bias tools are primarily designed for clinical or epidemiological trials and are less suited to hybrid FL and systems/ML studies, which means that our approach is less standardized than those instruments and relies partly on subjective judgement. The detailed quality assessment for all 31 full-text studies, including Q1–Q5 scores, total score, and inclusion or exclusion rationale, is provided in Appendices
Appendix A.1 and
Appendix A.2.
3.5. Data Extraction
A structured data extraction template was developed to ensure consistency in all 17 included studies. The lead reviewer (first author) extracted all data, with verification provided by supervisory academic staff (second and third authors). Extracted data fields included the following:
Study Characteristics
Title, author names, publication year, publication venue, abstract, and keywords.
Study Design
Empirical format (e.g., experiment, simulation, case study, and feasibility analysis).
Methodological Features
FL implementation (architectures, aggregation algorithms, and toolkits);
AI models used (e.g., CNN, LSTM, or Transformer).
Privacy or security methods (e.g., differential privacy, homomorphic encryption, or SMPC).
Deployment setting (cloud, edge, or fog) and hardware specifications;
Data characteristics (source type, data modality, and real-time/synthetic/public).
Outcomes and Evaluation
Model performance (accuracy, F1 score, and convergence metrics);
System-level evaluation (e.g., communication cost, latency, and energy efficiency);
Validation methods (e.g., cross-validation or real-time pilot testing);
Descriptive or quantitative reporting of privacy–performance trade-offs.
This structured process enabled thematic coding and comparative synthesis across studies to address the four core research questions.
3.6. Threats to Validity
This review has several limitations that should be acknowledged. First, the database search relied on specific keyword combinations in English, so FL studies involving mental health-relevant conditions or edge/fog deployments that used different terminology or appeared in non-English venues may have been missed, even though backward snowballing reduced this risk. Second, screening, quality assessment, and data extraction were led by a single reviewer, with verification by two supervisors, which may have introduced some selection or interpretation bias despite the use of a structured checklist. Third, substantial heterogeneity in study designs, datasets, and evaluation protocols limits direct comparability and precludes meta-analysis, so the synthesis emphasizes qualitative patterns rather than pooled quantitative effects. To address this limitation, the snowballing method was used at the end of the second stage of the selection process in order to find more primary studies in case some studies were missed during the search. This approach involved reviewing the reference lists of all included studies and identifying additional relevant papers cited within them, thereby ensuring a more comprehensive coverage of the literature and reducing the risk of overlooking relevant papers.
4. Overview of Included Studies
The 17 studies included in this review were published between 2021 and 2024, illustrating a rapidly evolving research focus on federated learning (FL) in mental health. The overview of studies is highlighted in
Table 1. As shown in
Figure 2, only two studies were published in 2021, with this number gradually increasing through 2022 and 2023, followed by a sharp rise to seven publications in the first ten months of 2024 alone. This pattern suggests growing recognition of FL as a viable privacy-preserving approach for distributed mental health modelling, particularly in the wake of increased global attention to mental well-being, decentralised health infrastructure, and data sensitivity in digital health applications (Dubey et al., 2025) [
6].
Thematically, the studies span a range of mental health conditions, though with a clear imbalance.
Figure 3 highlights this distribution. Depression detection is the most frequently studied condition, addressed in five studies [
27,
28,
29,
30,
31], typically leveraging social media, linguistic data, or smartphone behaviour. This focus corresponds with other recent observations by Jlassi et al. (2025) [
32] and Ebrahimi et al. (2024) [
33], who emphasised that depression research dominates FL applications due to the abundance of accessible, annotated datasets. General mental health monitoring, through activity recognition, stress detection, or behavioural proxy, is also prominent, appearing in another five studies [
34,
35,
36,
37,
38].
The included studiesoften combine passive sensor streams and real-time inference from edge devices, aligned with the methodological practices described by Rashmi et al. (2023) [
39]. Abnormal health detection (e.g., stroke or cognitive decline) appears in two studies [
40,
41], and epilepsy is addressed in two others [
42,
43]. Less frequently, FL is applied to Alzheimer’s disease [
39,
44], emotion analysis [
38], and autism spectrum disorder [
45]. Although several papers reference co-occurring physical or neurological conditions, such as brain tumours [
39], asthma and stroke [
40,
41], or neurodegeneration [
44], they do not explicitly model comorbidities in their frameworks. As Suruliraj and Orji (2022) [
27] and Park D et al. (2024) [
46] argue, FL for mental health has largely remained focused on one disease, limiting its relevance to the complex multimorbidity profiles seen in clinical practice.
Table 1.
Overview of included studies.
Table 1.
Overview of included studies.
| Study Title | Ref. | Year | Mental Health Focus | Co-Existing Condition |
|---|
| (Alahmadi et al., 2024) | [34] | 2024 | Mental stress detection | – |
| (Suruliraj & Orji, 2022) | [27] | 2022 | Depression detection | – |
| (Rashmi et al., 2023) | [39] | 2023 | Alzheimer’s disease diagnosis (early stages) | Brain tumour |
| (Shaik et al., 2022) | [35] | 2022 | General mental health (remote patient monitoring) | – |
| (C. Zhang et al., 2024) | [36] | 2024 | General mental health (activity recognition) | – |
| (Nurmi et al., 2023) | [37] | 2023 | General mental health (chronic disease monitoring) | Diabetes, obesity, and respiratory diseases |
| (Ching et al., 2024) | [40] | 2024 | Abnormal health detection (depression, stroke) | Stroke and asthma (not analysed) |
| (Liu, 2024) | [28] | 2024 | Depression detection (social media-based FL) | Workplace depression (not analysed) |
| (Suryakala et al., 2024) | [42] | 2024 | Epilepsy seizure detection | Epilepsy-related comorbidities mentioned |
| (D. Y. Zhang et al., 2021) | [41] | 2021 | Abnormal health detection (depression and stroke) | Asthma (not analysed) |
| (Tabassum et al., 2023) | [29] | 2023 | Depression detection | Workplace depression (not analysed) |
| (Lakhan et al., 2023) | [45] | 2023 | Autism spectrum disorder detection | – |
| (Xu et al., 2022) | [30] | 2022 | Depression detection | – |
| (Chhikara et al., 2021) | [38] | 2021 | Emotion analysis (workplace stress and post-pandemic mental health) | Workplace stress (not analysed separately) |
| (Mandawkar & Diwan, 2024) | [44] | 2024 | Alzheimer’s disease detection | Neurodegenerative disorders (not analysed separately) |
| (Baghersalimi et al., 2024) | [43] | 2024 | Epileptic seizure detection | – |
| (Li et al., 2023) | [31] | 2023 | Depression detection | – |
7. Discussion
The 17 reviewed studies reveal several recurring architectural patterns. Most systems still rely on cloud-centred FedAvg with limited edge deployment, while a smaller subset explores hierarchical, asynchronous, or decentralised overlays that better match heterogeneous devices. Common system limitations include small or simulated client pools, sparse reporting of system-level metrics (latency, energy, and bandwidth), and weak alignment between model complexity and edge hardware constraints. Deployment realism remains limited because many evaluations use software-only testbeds or short pilots, with few studies testing under realistic network variability, long-term use, or client churn. Together with narrow diagnostic coverage and minimal integration of formal privacy mechanisms, these issues create barriers to clinical adoption, where robustness, multimorbidity modelling, regulatory compliance, and end-to-end security are essential.
Across the 17 studies, only a small subset explicitly references formal data governance or regulatory frameworks; for example, Nurmi et al. [
37] highlight GDPR and data-sovereignty considerations in their smart-home FL platform, whereas most other systems address privacy only at the algorithmic level, without specifying accountability for model updates, logging, or breach notification. Real-world deployment remains limited: a few studies report pilots on actual smartphones or embedded devices (e.g., [
27,
36,
42]), but many evaluations are conducted in software-only or small-scale testbeds. Clinical validation and safety assessment are largely absent; none of the reviewed works conducts prospective trials in routine care, and only a minority explicitly involve clinicians or patients in system evaluation. These gaps in governance, regulation, deployment realism, and clinical validation currently limit the readiness of FL systems for widespread adoption in mental health services.
This review synthesises the results of 17 empirical studies that assessed the use of federated learning (FL), cloud, edge, and fog computing in relation to mental health applications. While the reviewed studies show increasing technical innovation, they also exhibit limited clinical realism, sparse evaluation of deployment, and weak interdisciplinary grounding. Overall, the field appears to be in a formative but fragmented stage, with diverse methodologies but no common frameworks for real-world scalability or clinical translation.
In order to ensure methodological rigour and comprehensive coverage, a systematic literature review (SLR) process was followed in this review. Studies were identified through structured searches across major databases (e.g., IEEE Xplore, PubMed, and Scopus) using predefined inclusion and exclusion criteria. Screening was conducted in multiple stages—title/abstract review, full-text assessment, and quality appraisal—guided by PRISMA principles. Data extraction focused on technical architectures, clinical domains, evaluation strategies, and interdisciplinary integration.
Federated learning remains the dominant architectural strategy underpinning decentralised mental health AI, particularly due to its privacy-preserving design and compatibility with distributed data sources (Dubey et al., 2025) [
6]. The majority of studies implemented horizontal FL schemes with centralised aggregation—most often using FedAvg (e.g., Refs. [
27,
28,
29,
34]). Although algorithmic simplicity may account for this, few papers explored strategies for FL under asynchronous, decentralised, or hierarchical conditions. Zhang et al. [
36] introduced a personalised multi-level FL framework tailored to heterogeneity in IoT environments, and Study 16applied decentralised FL for resource-constrained seizure detection. Li et al. (2023) [
31] uniquely adopted asynchronous optimisation but lacked empirical benchmarking of the stability, energy cost, and fairness of the proposed model under partial client participation. These examples, while encouraging, represent isolated efforts within a broader landscape where convergence behaviour, non-IID data adaptation, and fairness-aware aggregation are largely underexplored.
Model selection across studies was, varied spanning CNNs (e.g., Refs. [
39,
44,
45]), LSTM hybrids [
45], and transformers [
31], but the rationales for these choices in relation to real-world deployment constraints were rarely discussed. Only a subset of studies engaged meaningfully with edge-device limitations such as memory footprint, on-device latency, or inference efficiency. For instance, while Refs. [
41,
43] deployed FL on Jetson boards and wearable sensors, respectively, system-level performance metrics (e.g., update lag, communication cost, and thermal load) were inconsistently or qualitatively reported. These gaps suggest that FL models are often evaluated more for algorithmic behaviour than for full-stack feasibility.
In terms of infrastructure, cloud computing was often assumed but rarely interrogated in detail. Studies referencing cloud integration (e.g., Refs. [
28,
34,
37]) mostly concerned themselves with storage or coordination functions, without much discussion of backend orchestration, the latency of services, or cost trade-offs under varying workloads. Fog computing, even though mentioned in several studies (e.g., Refs. [
34,
40,
45]) was conceptualised as an intermediate layer between edges and the cloud, yet empirical evaluation of fog-layer performance (e.g., routing resilience, real-time responsiveness, or system failover) was absent. Edge deployment, while described in a number of papers (e.g., Refs. [
35,
36,
41,
43]), was more frequently simulated than realised, leaving questions around interoperability, energy management, and local processing still open.
The diagnostic landscape was also relatively narrow. The majority of studies focused on identification of depression or stress (e.g., Refs. [
27,
28,
29,
30,
31]), while conditions such as autism [
45], epilepsy [
42,
43], Alzheimer’s disease [
39,
44], and anxiety received comparatively limited attention. Even in studies considering multimodal data (e.g., Refs. [
35,
38]), cross-modal fusion was sparsely employed, and few of the frameworks took into account comorbidity-aware or multi-label architectures. This is a significant gap, considering the prevalence of diagnostic overlap in clinical mental health care. Prior reviews (e.g., Grataloup and Kurpicz-Briki, 2024 [
10]; Khalil et al., 2024 [
68]) similarly emphasized the importance of more nuanced, inclusive FL designs that reflect population-level heterogeneity.
Evaluation practices were widely varied. Predictive metrics such as accuracy and F1 score were universally reported, but system-level measures (e.g., inference delay, resource consumption, and fault tolerance) were usually omitted. Baghersalimi et al. [
43] mentioned energy-aware constraints in wearable systems, without direct comparison of baseline architectures. Privacy-enhancing mechanisms were used occasionally: differential privacy was applied in [
31], while Refs. [
39,
44] used blockchain for distributed authentication. However, the computational demands of the proposed models and their effects on utility were not empirically evaluated, contributing to the more general tendency in FL studies to overemphasize conceptual innovation over deployment maturity (Geyer et al., 2017 [
69]).
Author-reported limitations repeat many of these issues. Sample sizes were frequently small or synthetic (e.g., Refs. [
29,
36,
37,
39]); deployment configurations were minimal, with several studies testing on only a handful of edge clients (e.g., Refs. [
40,
41]); and no included study conducted ablation analyses or user-centric validation under asynchronous or fault-prone settings. These patterns indicate that FL research in mental health remains mostly conceptual and preclinical, a consistent interpretation that is echoed by the recent literature (Khalil et al., 2024 [
9]).
Taken together, the reviewed studies are indicative of both the promise and the incomplete architectures of FL-enabled mental health systems. While algorithmic creativity is evident, there are still lapses in deployment realism, system benchmarking, diagnostic inclusivity, diversity of design. In particular, very few studies have addressed the effects of edge–cloud coordination and fog-based buffering on downstream accuracy, fairness, and user latency in low-resource networks. Furthermore, no framework offers end-to-end modelling of resource-aware, privacy-preserving analytics across multiple devices, data types, and conditions.
Looking ahead, significant progress could depend on a number of strategic directions. First, integration between the design of the FL algorithm and system-level constraints such as intermittent connectivity, memory, and user behaviour would improve the viability of deployments. Second, the development of comorbidity-aware, multi-label models and multimodal fusion pipelines could enhance clinical relevance across diverse populations. Third, benchmarking frameworks should broaden to encompass latency, robustness, energy use, and fairness under asynchronous, non-IID, and low-participation regimes. Finally, cloud–fog–edge orchestration layers merit closer examination—not only in architectural diagrams but also in deployment trials that measure trade-offs across throughput, resilience, and patient-centric privacy.
These challenges, while significant, are addressable. They represent an evolving research frontier where federated mental health systems, if better aligned across technical and clinical domains, hold substantial potential to transform digital mental health and equitable care delivery.
Across the 17 studies, FL is predominantly implemented as cloud-centred horizontal FedAvg, with a central server coordinating updates from smartphones or IoT clients [
27,
29,
30,
31,
35], while only a minority explore alternative architectures such as hierarchical or decentralised schemes [
36,
39,
41,
43,
44]. Overall, standard cloud-based FedAvg is more frequently used than hierarchical, decentralised, or edge-native schemes, which appear in only a small subset of the reviewed work.
The observed cloud–edge deployment choices have direct implications for latency, energy efficiency, scalability, and privacy. As FL moves from conceptual frameworks to real-world implementations, these design decisions will determine whether systems are performant and ethically robust in practice. A single study demonstrates a fully decentralised, cloud-free deployment across hospital devices [
43], which aligns with recent proposals for zero-trust, peer-based FL in healthcare [
70] but also raises open questions about global coordination, interpretability, and clinical accountability.
Overall, the cloud–edge integration strategies reported in
Section 6 reveal a broad design space, from private clouds and public platforms to embedded hardware, yet there is little empirical evaluation of end-to-end latency, energy use, and fault tolerance.
The evaluation patterns observed in RQ1 are consistent with wider concerns that FL lacks agreed-upon protocols for assessing utility, efficiency, and resilience in distributed environments. While some papers provide detailed predictive metrics, many neglect latency, scalability, communication cost, and robust non-IID testing, leaving the current evidence base short of the multi-layered validation needed to judge deployment readiness in decentralised mental health settings. This gap contrasts with broader federated learning research that stresses the need to measure and stress test systems under realistic heterogeneity [
28,
50].