1. Introduction
Digital healthcare systems generate large amounts of data, but turning these data into useful decisions is still difficult. Electronic health records (EHRs), laboratory systems, imaging scanning platforms, and patient-facing devices all produce valuable information. These data could support earlier risk detection, more personalised care, and better clinical pathway planning. However, building deployable clinical intelligence systems from such data is still challenging [
1,
2,
3,
4,
5]. Patient-level data are distributed across different hospitals and systems, governed by privacy and ethical requirements, and often stored in different formats. This makes cross-site model development slow and costly. As a result, many studies still rely on data from a single institution, which can limit external validity. The issue is even more serious for rare-disease prediction, where cohorts are small and geographically distributed, and collaboration across multiple organisations is often needed to obtain sufficient statistical power [
6].
Federated learning (FL) has become an important approach for multi-site modelling under privacy constraints [
7,
8,
9]. In FL, each site keeps its data locally and shares only model updates rather than raw records. This fits well when sensitive data needs to remain at the source. FL has already been explored in healthcare applications such as risk prediction, medical imaging, and population health analytics [
10,
11,
12,
13,
14,
15]. A key advantage of FL is that it allows models to learn from multiple sites without centralising patient records, which may improve generalisability while reducing the need for large-scale data transfer.
Despite these advantages, many healthcare FL studies are still difficult to translate into deployable services. First, clinical adoption requires auditability and accountability. Stakeholders often need to know what model was trained, which sites participated, what configuration was used, how performance changed over time, and whether the process was stable and compliant. However, many FL implementations focus mainly on the training loop and provide only limited support for audit-ready logging, end-to-end traceability, and governance-oriented reporting [
9,
11]. Second, in the healthcare domain, FL should deal with heterogeneity [
16]. Data across sites are often Non-IID (not identically distributed) because of differences in patient populations, clinical practice, coding styles, and measurement processes. This can slow convergence and lead to unstable or inconsistent results if it is not properly monitored [
8,
9]. Third, practical deployment also requires clear communication. Clinicians and operational teams often need short and understandable summaries of performance, limitations, and readiness, rather than only raw metrics or low-level system logs. Finally, real deployment requires workflow integration, including round orchestration, monitoring, API-based service boundaries, and predictable runtime behaviour that can be demonstrated during evaluation and stakeholder review.
Existing FL frameworks provide strong technical foundations [
17,
18,
19,
20], but many deployment-related components, such as dashboards, evidence trails, stakeholder reporting, and controlled baseline comparisons, still need to be implemented separately. Because of this, there is still a gap between algorithm-focused FL prototypes and healthcare-ready systems that treat operational transparency, audit readiness, and stakeholder communication as first-class requirements.
To address this gap, we present MediVault, a privacy-first platform designed to operationalise federated learning for healthcare pilots, with particular attention to deployability, auditability, and communication. MediVault is motivated by multi-organisation scenarios such as rare-disease risk prediction, where institutions need to collaborate while maintaining local control of sensitive records. Rather than treating FL as a standalone training routine, MediVault provides an integrated workflow from orchestration to evaluation. It allows teams to (i) run round-based training across distributed sites, (ii) compare FL against a centralised baseline using the same model and hyperparameters, and (iii) produce governance- and clinician-oriented summaries based on logged evidence. In particular, MediVault provides an integrated workflow for privacy-preserving healthcare collaboration. It combines federated coordination, site-level local training, audit-ready telemetry, and a dashboard for evaluation and governance. In addition, the platform supports evidence-grounded reporting to improve stakeholder communication. We evaluate MediVault on two public healthcare classification datasets under both IID and Non-IID settings to reflect realistic site heterogeneity. The results show that federated training is competitive with centralised training, while MediVault also provides the transparency, auditability, and communication features needed for practical healthcare pilots.
Rather than proposing a new FL optimiser or cryptographic primitive, MediVault makes a system-level contribution. It designs and implements a deployment-focused healthcare FL workflow that integrates protected update exchange, secure aggregation behaviour, and auditable governance evidence within a single operational system.
The rest of this paper is organised as follows.
Section 2 reviews related work on federated learning in healthcare, privacy-preserving collaboration, and deployment-oriented FL systems.
Section 3 presents the MediVault architecture, workflow, and key system components.
Section 4 describes the experimental setup and evaluates the platform from both system-level and model-level perspectives. Finally,
Section 5 concludes the paper and outlines directions for future work.
3. Proposed System
The proposed system, called MediVault, is an auditable and security-aware federated learning-based system that supports collaborative model training across multiple healthcare data custodians without centralising patient-level data. The system is designed for healthcare analytics scenarios in which institutions need to collaborate while preserving local data control. MediVault follows a federated learning (FL) setting in which participating healthcare sites train locally on their private datasets and share only protected model updates. Rather than treating FL as an isolated training routine, MediVault provides an integrated workflow that combines round orchestration, protected update exchange, secure aggregation, and audit-ready system evidence.
3.1. System Architecture
Figure 1 shows the overall MediVault workflow. In each training round, the federated coordinator broadcasts the current global model to the participating healthcare sites. Each site then performs local training on its private dataset and computes a local model update. Before transmission, the local update is protected through an HE-based update protection step. The protected updates are then combined through an SMPC-inspired secure aggregation process, so that the coordinator receives only an aggregated result rather than plaintext individual updates. This aggregated result is used to form the updated global model, which is broadcast again for the next round.
MediVault combines a federated coordinator, site-level local training at peer nodes, and a protected update pipeline for secure submission and aggregation. Together, these elements support a protected round-based workflow in which local model updates are generated at each site, protected before transmission, securely aggregated, and then used to update the global model for the next round. The round-based training procedure is described next, followed by the two protection mechanisms used for secure update handling and aggregation.
3.2. Threat Model and Security Scope
MediVault is designed under an honest-but-curious coordinator setting. The coordinator is assumed to follow the protocol for model broadcast, update collection, and aggregation, but may attempt to infer information from the updates it receives. Participating healthcare sites keep raw patient data locally and transmit only protected model updates. Under this scope, HE protects updates during transmission and encrypted aggregation, while the SMPC-inspired masking mechanism reduces coordinator visibility of individual site updates.
Prior work has shown that gradients and model updates can leak sensitive information under certain attack settings [
21,
22]. MediVault therefore focuses on reducing update exposure rather than relying only on the statement that data remain local. The current implementation does not provide a formal treatment of malicious model poisoning, Byzantine clients, collusion among compromised parties, or dropout-tolerant multi-party secure aggregation. In addition, the evidence logs are intended for operational auditability and are not yet implemented as cryptographically tamper-evident logs. Stronger guarantees, including collusion-resistant secure aggregation, signed append-only logs, hash chaining, trusted timestamping, and formal cryptographic proofs, are left for future work.
HE-based aggregation also does not by itself prevent inference from aggregate outputs or the final model. Recent robust FL studies on label-flipping attacks and graph-based clustering aggregation highlight the importance of combining update confidentiality with robust aggregation and attack detection [
30,
31]. Integrating these defences into MediVault is left for future work. The security mechanisms in MediVault are therefore presented as prototype-level update-protection components under the stated honest-but-curious setting, rather than as formally proven cryptographic guarantees.
3.3. Federated Learning Workflow
Assume that training proceeds in synchronous rounds . Let denote the global model parameters at round t. Each peer holds a private local dataset . The workflow below describes how local updates are generated and then passed to the protected update pipeline for secure submission and aggregation.
- 1.
Broadcast: The coordinator broadcasts the current global model and round identifier t to all participating peers.
- 2.
Local training: Each peer performs local optimisation for
E epochs (or steps) and obtains updated parameters
. The local model update is then computed as
- 3.
Protected submission: Each peer protects
using the protected update pipeline described in
Section 3.4 and
Section 3.5, and submits only the protected update to the coordinator.
- 4.
Aggregation and model update: The coordinator aggregates the protected updates and applies the resulting global update:
where
is the server learning rate.
In MediVault, the summation is not carried out over plaintext individual updates. Instead, aggregation is performed through the protected update pipeline described below.
3.4. HE-Based Update Protection for Encrypted Aggregation
To protect the confidentiality of peer updates during transmission and aggregation, MediVault uses an additive homomorphic cryptosystem, specifically Paillier. Let and denote encryption and decryption under the corresponding public and private keys.
Each peer encrypts its protected update vector element-wise:
where
denotes the peer update after optional masking. Due to the additive homomorphism of Paillier, the coordinator can combine ciphertexts without decrypting individual updates:
where ⊕ denotes ciphertext-domain addition. The coordinator decrypts only the aggregated ciphertext:
This design prevents the coordinator from directly observing plaintext individual updates during aggregation under the honest-but-curious coordinator assumption. In the current prototype, this encrypted aggregation remains practical because the evaluated models are lightweight and keep the protected update dimensionality manageable.
3.5. SMPC-Inspired Secure Aggregation via Additive Masking
MediVault further reduces exposure of individual updates by combining HE with an SMPC-inspired additive masking mechanism. The goal is that the coordinator receives only encrypted, masked updates and recovers only an aggregated result.
Let
denote peer
i’s local model update at round
t, and let
denote a pseudo-random mask vector derived from a shared seed and the round identifier
t. Peer
i forms a masked update as
where
controls mask cancellation. The peer then encrypts and transmits only
Using HE additivity, the coordinator combines ciphertexts and decrypts only the aggregated masked sum:
In the current prototype, masking is implemented in a
two-party setting (
) by assigning opposite signs to the two peers so that masks cancel after aggregation:
Thus, the coordinator recovers only the aggregated update and not any individual plaintext update under the stated threat model. The aggregated update is then used in a FedAvg-style global model update. Extending this masking mechanism to larger multi-party settings with dropout tolerance, collusion resistance, latency analysis, and formal security analysis is left as future work.
In addition, MediVault records round-level metadata, including round identifiers, participating peers, protected message metadata, aggregation status, and model-level summaries. These records are surfaced through the dashboard to support auditability and governance review without exposing patient-level data.
4. Evaluation
This section evaluates MediVault from two aspects: (i) system-level evidence, showing that the current implementation supports end-to-end execution with a working dashboard, protected update exchange, and an auditable protocol timeline; (ii) model-level utility, comparing federated training against a centralised baseline under both IID and Non-IID data partitions. We primarily report results for two lightweight linear classifiers, logistic regression (LOGREG) and linear SVM (LINSVM), and additionally include a lightweight MLP experiment to address non-linear model behaviour under representative settings.
4.1. Implementation and Dashboard Views
A key contribution of MediVault is that the protected collaboration workflow is not only specified conceptually but also demonstrated through an operational dashboard.
Figure 2,
Figure 3 and
Figure 4 provide end-to-end evidence of: (i) global task configuration and round-level learning status at the coordinator; (ii) peer-side execution where each site trains locally and submits encrypted, masked model updatesrather than raw patient records. These views support a deployment-oriented narrative: the current implementation operationalises secure multi-party collaboration while preserving data locality.
In addition to the primary workflow, MediVault provides a dedicated
collaboration evidence layer, as shown in
Figure 5 and
Figure 6. This layer is designed to improve auditability and partner confidence by exposing protocol-level artefacts, such as message metadata, encryption timings, and aggregation steps, while remaining non-sensitive. Such evidence is particularly relevant for healthcare collaborations where governance requirements demand operationally verifiable traces without disclosure of patient-level information. In addition,
Figure 7 shows an optional reporting interface that generates narrative summaries from non-sensitive aggregated evidence rather than raw patient records. This interface is intended to support stakeholder communication by translating logged metrics and protocol-level evidence into a more accessible form, and can be achieved using either a cloud-based generative AI service or a local model. The reporting interface follows a data-minimisation design: inputs are limited to aggregated metrics, protocol events, message metadata, and optional site-level aggregates, while raw patient records and per-sample data are excluded. For sensitive deployments, a local model can be used to avoid exporting even aggregated evidence to a third-party service, and the generated summaries are treated as communication support rather than clinical decision outputs.
4.2. Experimental Setup
We evaluate MediVault on two public binary classification datasets: Breast Cancer Wisconsin (Diagnostic) (
breast_cancer) [
32] and Heart Disease (
heart_disease) [
33]. Each dataset is split into training and test partitions using a fixed random seed (seed = 7) and an 80/20 stratified split to preserve class proportions. All reported metrics are computed on the held-out test set. The current evaluation is intended as an initial system-level validation using public tabular healthcare benchmarks, rather than a comprehensive benchmark across all FL frameworks, clinical datasets, and deployment settings.
4.2.1. Models, Baselines, and FL Setting
We primarily compare two lightweight linear models that are common in clinical risk prediction, and additionally evaluate a small MLP to test whether the workflow supports a non-linear classifier:
LOGREG: logistic regression (probabilistic linear classifier).
LINSVM: linear SVM (margin-based linear classifier).
MLP: a lightweight feed-forward neural network with one hidden layer of 32 units and ReLU activation, evaluated under representative 5-peer settings.
Centralised baseline (Non-FL): the model trained on the union of all training data.
Federated learning (FL): Peers train locally and submit model updates to a coordinator. The coordinator applies a FedAvg-style aggregation over received updates and evaluates the global model each round.
4.2.2. Peer Partitions (IID vs. Non-IID)
To study heterogeneity, training data are partitioned across peers under:
IID: each peer receives a roughly representative sample of the overall data distribution.
Non-IID: peer data distributions are intentionally skewed so that different peers no longer follow the same underlying distribution, reflecting realistic site heterogeneity.
The Non-IID setting is implemented as a label-skew partition, where local peer datasets are assigned different class proportions to approximate site-level case-mix differences across hospitals. For the representative 5-peer MLP experiment, the target class-skew schedule ranges from approximately 0.70 to 0.30 across peers. This provides a simple quantitative heterogeneity control, although it does not fully capture richer clinical heterogeneity such as feature shift, coding variation, missingness, or temporal drift. We evaluate 2 peers and 5 peers to examine how scaling the number of sites influences convergence and performance. Larger client populations, client dropout, and end-to-end latency are not fully evaluated in the current prototype and are discussed as future work.
4.2.3. Metrics and Reporting Protocol
We report three standard metrics for medical risk prediction:
Accuracy (ACC): overall classification correctness.
Area Under the ROC Curve (AUC): threshold-independent ranking quality.
F1-score (F1): balances precision and recall, which is useful under potential class imbalance.
For each setting, we report: (i) Final round performance (fixed-budget deployment view), (ii) Best-over-rounds (attainable peak, relevant for early stopping), and (iii) Mean ± Std across rounds (stability). In
Table 1,
Table 2 and
Table 3,
denotes FL minus Base under the same dataset/model/partition/peer configuration. The current evaluation does not claim statistical significance across multiple random seeds. The reported differences should therefore be interpreted as prototype-level empirical evidence under a fixed reproducible split; multi-seed statistical testing is identified as future work.
4.2.4. Implementation and Reproducibility Details
For LOGREG and LINSVM, experiments use 20 global rounds, one local epoch per peer per round, and a server learning rate of 0.01. The same held-outand test set is used for both centralised and federated evaluation. For the MLP experiment, we use 80 global rounds, three local epochs per peer per round, a batch size of 16, a learning rate of , and FedAvgM server momentum of 0.5. In the protected-update prototype and overhead measurement, Paillier-style additive HE with a 1024-bit key is applied element-wise to fixed-point encoded model update vectors. The implementation records update dimensionality, protected payload size, encryption time, aggregation status, and round identifiers as dashboard metadata. All experiments use the fixed random seed described above to support reproducibility.
4.2.5. Secure Update Confidentiality and Secure Aggregation
MediVault follows an update-confidential FL design that combines homomorphic encryption (HE) with an SMPC-inspired additive masking mechanism (see the full protocol description in the Proposed System section). As shown in
Figure 5 and
Figure 6, we validate the operational behaviour of this design by exposing non-sensitive protocol artefacts in the dashboard: (i) each peer submits only encrypted, masked updates (no plaintext updates and no patient-level records); (ii) the coordinator performs additive combination on ciphertexts and decrypts only the aggregated sum; (iii) the secure-aggregation trace view provides message-level evidence such as vector dimensionality, payload size, mask identifiers/signs, and ciphertext hashes or samples, together with an ordered protocol timeline for auditability. These dashboard traces demonstrate that encrypted update exchange and masked aggregation are executed end-to-end in the prototype, supporting partner assurance without revealing local training data.
4.3. Auditability and Governance Evidence
Beyond predictive performance, MediVault is evaluated on auditability —the ability to provide non-sensitive, machine-recorded evidence that a privacy-preserving collaboration occurred. This is particularly important for cross-organisation healthcare deployments where partners must justify data governance decisions and demonstrate compliance-oriented controls.
As shown in
Figure 5 and
Figure 6, the implementation exposes an evidence layer that logs: (i) secure message metadata (peer identifier, round index, payload size, encryption time, mask identifiers, and protocol events); (ii) an ordered protocol timeline of events (round start, message receipt, secure combine, global update, and evaluation). These artefacts are designed to be non-patient-level yet operationally verifiable, supporting post hoc inspection and partner assurance without disclosing local records or per-sample information. In this paper, auditability is evaluated using three practical criteria: whether round-level evidence is recorded, whether the recorded evidence is non-sensitive, and whether the dashboard supports post hoc inspection of training and aggregation events. These criteria do not constitute formal compliance certification and do not replace specialised monitoring platforms, but they provide an explicit basis for evaluating the governance evidence produced by the prototype.
More formally, we treat auditability as a set of measurable evidence properties rather than as a single accuracy-like score. For a training round r, let , where denotes trace completeness, denotes data minimisation, denotes post hoc inspectability, and denotes verifiability or tamper-evidence of the recorded log. In the current prototype, when the round records the participating peers, protected message metadata, aggregation status, and model-level summaries; when these artefacts exclude patient-level and per-sample information; and when the evidence can be reviewed through the dashboard after execution. The current implementation supports , , and through structured dashboard evidence, but it does not yet provide cryptographic tamper-resistance for through mechanisms such as hash chaining, digital signatures, append-only storage, or trusted timestamping. This definition distinguishes the proposed evidence layer from generic system logging by linking logs to governance-oriented FL events and by making the remaining log-integrity gap explicit.
We therefore treat auditability as a first-class evaluation axis alongside accuracy metrics: the dashboard evidence demonstrates that MediVault provides a practical governance view for protected collaboration in addition to model training outcomes.
4.4. Results: Performance Comparison (Centralised vs. Federated)
Table 1,
Table 2 and
Table 3 summarise results for LOGREG and LINSVM. On
breast_cancer, both models achieve near-ceiling performance centrally, and FL remains competitive in the observed results: LOGREG largely matches the centralised baseline (final-round ACC ≈ 0.986), while LINSVM shows small drops under FL (e.g., up to ∼2–3% absolute ACC under 2-peer Non-IID). On
heart_disease, LOGREG yields the strongest centralised baseline, and FL shows slightly higher observed values under IID partitioning (e.g., ACC changes from 0.853 to 0.868). Under Non-IID partitions, final-round metrics can drop (ACC 0.838) but best-over-rounds remains competitive in this setting, suggesting that monitoring and early stopping may be practical strategies under heterogeneous deployments. LINSVM exhibits more sensitivity across configurations: it can match or show higher observed values than the baseline in some cases (e.g., 2-peer Non-IID ACC 0.836 vs. 0.803) but degrades under others (e.g., 5-peer IID ACC 0.787 vs. 0.803), indicating higher variance in heterogeneous or small-sample regimes.
To address the limitation of evaluating only linear models, we additionally include a lightweight MLP under representative 5-peer settings. As shown in
Table 4, the MLP results indicate that the MediVault workflow can also support a non-linear classifier. On
breast_cancer, FL closely matches the centralised baseline and shows slightly higher observed values in some metrics. On
heart_disease, FL also shows higher observed final-round values than the centralised MLP baseline, although the round-wise results still show that non-linear models can benefit from monitoring and early stopping. These results suggest that the proposed workflow is not limited to linear classifiers, while more extensive evaluation with deeper models remains future work.
To quantify the computational and communication cost of protected update exchange, we conduct a prototype-level overhead measurement for representative update vectors. The overhead experiment was conducted on a local Mac OS machine with an Apple M4 processor and 16 GB RAM, using a Python-based (Python 3.12) Paillier implementation. Each setting was repeated five times, and mean values are reported in
Table 5. The table reports update dimensionality, plaintext payload size, encrypted payload size, encryption time, ciphertext aggregation time, and decryption time under a 5-peer setting. These measurements are intended to characterise prototype-level overhead rather than optimised cryptographic performance.
The results show that the overhead increases with update dimensionality. For example, the encrypted payload per peer increases from approximately 9.42 KB for the breast_cancer linear models to 285.28 KB for the breast_cancer MLP. Similarly, encryption time per peer increases from approximately 258–259 ms for the linear models to 8679 ms for the MLP. These results confirm that HE-based protection introduces non-negligible computational and communication overhead, especially for larger update vectors, and motivates future optimisation and bandwidth-aware deployment planning.
FedAvg is used as the baseline aggregation rule because the purpose of this work is to evaluate the MediVault system workflow under a standard and widely understood FL training procedure rather than to propose a new optimiser. Under Non-IID data, FedAvg can be sensitive to client drift; therefore, the evaluation reports final-round, best-over-round, and mean ± std results to characterise stability, and the MLP experiment uses FedAvgM as a lightweight mitigation strategy. More advanced robust or personalised aggregation methods are compatible with the architecture and are left for future work.
Figure 8,
Figure 9 and
Figure 10 visualise final-round performance for Base vs. FL across datasets, partitions, peer counts, and models. Across the evaluated conditions, the observed FL performance is generally competitive with the centralised baseline. Differences are small on
breast_cancer due to near-ceiling performance, while larger sensitivity is observed on
heart_disease, particularly under Non-IID partitions and for LINSVM.
To complement the summary tables, we plot round-wise ACC/AUC/F1 trajectories under representative IID and Non-IID settings to illustrate convergence behaviour and stability. The distinction between final-round and best-over-round performance is particularly useful under heterogeneous data, where the final operating point may fluctuate even when strong intermediate rounds are reached.
Figure 11 shows that under IID partitioning, FL converges smoothly and can match or exceed the centralised baseline. Under Non-IID partitioning, convergence remains observable but with larger fluctuations and a lower final-round operating point. This pattern is consistent with client drift and slower stabilisation under heterogeneous sites, and it explains why monitoring best-over-round performance is useful when client distributions are skewed.
Figure 12 indicates that both LOGREG and LINSVM achieve near-ceiling performance for
breast_cancer, and FL closely tracks the centralised baseline under both IID and Non-IID partitions. In this dataset, differences are small and are more visible through stability than through final-point performance.
4.5. Discussion
The evaluation shows that MediVault provides both system-level and model-level value for privacy-preserving healthcare collaboration. At the system level, the dashboard and evidence views demonstrate that the current implementation supports local training, protected update exchange, secure aggregation, and auditable protocol traces without exposing patient-level data. This is important for healthcare deployments, where partner trust depends not only on privacy-preserving computation but also on visible and reviewable operational evidence.
At the model level, federated learning with LOGREG, LINSVM, and the additional lightweight MLP remains competitive with the centralised baseline across the evaluated datasets. On breast_cancer, the models operate near a saturated performance regime, so differences between centralised and federated settings are small. On heart_disease, LOGREG is generally more robust across partition and peer configurations, while LINSVM shows greater sensitivity. The MLP experiment further shows that the workflow can support a simple non-linear classifier, but that round-wise monitoring remains important for selecting stable operating points.
The results also confirm the expected effect of heterogeneity. Non-IID partitions increase variance and can lower final-round performance, especially on heart_disease. At the same time, the best-over-rounds results indicate that competitive operating points are still reachable, which supports the use of monitoring and early stopping in practical deployments. The current evaluation remains limited to public tabular datasets, 2–5 peers for the main experiments, and representative 5-peer MLP settings; larger-scale clients, dropout, and latency measurements remain to be explored in future work.
Finally, the optional reporting interface shown in
Figure 7 illustrates how non-sensitive aggregated evidence can be presented in a more accessible form for stakeholders. Together with the protocol-level metadata exposed by the dashboard, this suggests that MediVault can support not only privacy-preserving training, but also the transparency and governance readiness needed for real multi-site healthcare collaboration. The added HE overhead measurement further shows that protected update exchange introduces measurable computational and payload costs, especially for larger update vectors such as the MLP. This reinforces the need for optimised cryptographic implementation and bandwidth-aware deployment planning.
5. Conclusions
This paper presented MediVault, an auditable and security-aware federated learning-based system that enables privacy-preserving healthcare collaboration without sharing raw patient records. MediVault combines federated learning with prototype encrypted update exchange and an SMPC-inspired secure aggregation workflow, and exposes an auditable evidence layer through a working dashboard to support governance and partner trust. Experiments on breast_cancer and heart_disease using LOGREG, LINSVM, and an additional lightweight MLP suggest that federated training can remain competitive with a centralised baseline under the evaluated IID and Non-IID settings, with expected sensitivity to data heterogeneity.
Several directions remain for future work. First, the current secure aggregation and homomorphic-encryption mechanisms are demonstrated at prototype level, and future work should consider stronger adversarial settings, collusion resistance, dropout-tolerant multi-party secure aggregation, and more formal security analysis. Second, the present experiments use benchmark tabular datasets rather than real clinical deployment data, so further validation will require more representative healthcare datasets and appropriate governance or ethical pathways. Third, broader evaluation is needed under realistic network conditions, larger numbers of peers, client dropout, high-dimensional healthcare datasets, deeper neural architectures, multi-seed statistical significance testing, comparisons with established FL frameworks such as Flower, FedML, and FATE, and more extensive analyses of robustness, fairness, distribution shift, model inversion risk, poisoning resilience, network-level latency, and optimised cryptographic overhead.