MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration

Li, Jie; Adeel, Usman; Akram, Muhammad Safwan

doi:10.3390/a19060427

Open AccessArticle

MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration

by

Jie Li

^1,*

,

Usman Adeel

¹

and

Muhammad Safwan Akram

²

¹

School of Computing, Engineering & Digital Technologies, Teesside University, Middlesbrough TS1 3BA, UK

²

National Horizons Centre, Teesside University, Middlesbrough DL1 1HG, UK

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(6), 427; https://doi.org/10.3390/a19060427

Submission received: 16 April 2026 / Revised: 19 May 2026 / Accepted: 20 May 2026 / Published: 25 May 2026

(This article belongs to the Special Issue Artificial Intelligence in Modern Cybersecurity: Changes, Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

Healthcare analytics is often limited by data silos and strict privacy requirements, which make it difficult to share patient-level records across organisations and to build robust predictive models. Federated learning (FL) provides an alternative by keeping data local and exchanging model updates instead of raw records. However, many existing FL solutions remain difficult to deploy in healthcare settings, as they provide limited support for auditability, governance-oriented evidence, and system-level transparency. This paper presents MediVault, an auditable and security-aware federated learning-based system for privacy-preserving healthcare collaboration. MediVault combines round-based federated training, prototype-level protected update exchange, audit-ready telemetry, and an interactive dashboard that exposes non-sensitive evidence of collaboration, model progress, and protocol execution. In addition, the system supports controlled reporting to improve stakeholder communication during pilot deployments. We evaluate MediVault on two public healthcare classification datasets, Breast Cancer Wisconsin (Diagnostic) and Heart Disease, under IID and label-skewed Non-IID settings. Experiments are conducted using logistic regression, linear SVM, and an additional lightweight MLP under matched settings. The observed results suggest that federated training remains competitive with centralised training under the evaluated settings. A prototype-level overhead analysis further shows that protected update exchange introduces measurable computational and communication costs, especially for larger update vectors. These findings indicate that MediVault can support initial system-level validation of auditable, privacy-preserving healthcare FL workflows, while further work is needed for larger-scale deployment, stronger adversarial evaluation, and real-world clinical validation.

Keywords:

federated learning; privacy-preserving machine learning; healthcare analytics; secure aggregation; auditability; homomorphic encryption

Graphical Abstract

1. Introduction

Digital healthcare systems generate large amounts of data, but turning these data into useful decisions is still difficult. Electronic health records (EHRs), laboratory systems, imaging scanning platforms, and patient-facing devices all produce valuable information. These data could support earlier risk detection, more personalised care, and better clinical pathway planning. However, building deployable clinical intelligence systems from such data is still challenging [1,2,3,4,5]. Patient-level data are distributed across different hospitals and systems, governed by privacy and ethical requirements, and often stored in different formats. This makes cross-site model development slow and costly. As a result, many studies still rely on data from a single institution, which can limit external validity. The issue is even more serious for rare-disease prediction, where cohorts are small and geographically distributed, and collaboration across multiple organisations is often needed to obtain sufficient statistical power [6].

Federated learning (FL) has become an important approach for multi-site modelling under privacy constraints [7,8,9]. In FL, each site keeps its data locally and shares only model updates rather than raw records. This fits well when sensitive data needs to remain at the source. FL has already been explored in healthcare applications such as risk prediction, medical imaging, and population health analytics [10,11,12,13,14,15]. A key advantage of FL is that it allows models to learn from multiple sites without centralising patient records, which may improve generalisability while reducing the need for large-scale data transfer.

Despite these advantages, many healthcare FL studies are still difficult to translate into deployable services. First, clinical adoption requires auditability and accountability. Stakeholders often need to know what model was trained, which sites participated, what configuration was used, how performance changed over time, and whether the process was stable and compliant. However, many FL implementations focus mainly on the training loop and provide only limited support for audit-ready logging, end-to-end traceability, and governance-oriented reporting [9,11]. Second, in the healthcare domain, FL should deal with heterogeneity [16]. Data across sites are often Non-IID (not identically distributed) because of differences in patient populations, clinical practice, coding styles, and measurement processes. This can slow convergence and lead to unstable or inconsistent results if it is not properly monitored [8,9]. Third, practical deployment also requires clear communication. Clinicians and operational teams often need short and understandable summaries of performance, limitations, and readiness, rather than only raw metrics or low-level system logs. Finally, real deployment requires workflow integration, including round orchestration, monitoring, API-based service boundaries, and predictable runtime behaviour that can be demonstrated during evaluation and stakeholder review.

Existing FL frameworks provide strong technical foundations [17,18,19,20], but many deployment-related components, such as dashboards, evidence trails, stakeholder reporting, and controlled baseline comparisons, still need to be implemented separately. Because of this, there is still a gap between algorithm-focused FL prototypes and healthcare-ready systems that treat operational transparency, audit readiness, and stakeholder communication as first-class requirements.

To address this gap, we present MediVault, a privacy-first platform designed to operationalise federated learning for healthcare pilots, with particular attention to deployability, auditability, and communication. MediVault is motivated by multi-organisation scenarios such as rare-disease risk prediction, where institutions need to collaborate while maintaining local control of sensitive records. Rather than treating FL as a standalone training routine, MediVault provides an integrated workflow from orchestration to evaluation. It allows teams to (i) run round-based training across distributed sites, (ii) compare FL against a centralised baseline using the same model and hyperparameters, and (iii) produce governance- and clinician-oriented summaries based on logged evidence. In particular, MediVault provides an integrated workflow for privacy-preserving healthcare collaboration. It combines federated coordination, site-level local training, audit-ready telemetry, and a dashboard for evaluation and governance. In addition, the platform supports evidence-grounded reporting to improve stakeholder communication. We evaluate MediVault on two public healthcare classification datasets under both IID and Non-IID settings to reflect realistic site heterogeneity. The results show that federated training is competitive with centralised training, while MediVault also provides the transparency, auditability, and communication features needed for practical healthcare pilots.

Rather than proposing a new FL optimiser or cryptographic primitive, MediVault makes a system-level contribution. It designs and implements a deployment-focused healthcare FL workflow that integrates protected update exchange, secure aggregation behaviour, and auditable governance evidence within a single operational system.

The rest of this paper is organised as follows. Section 2 reviews related work on federated learning in healthcare, privacy-preserving collaboration, and deployment-oriented FL systems. Section 3 presents the MediVault architecture, workflow, and key system components. Section 4 describes the experimental setup and evaluates the platform from both system-level and model-level perspectives. Finally, Section 5 concludes the paper and outlines directions for future work.

2. Background

2.1. Federated Learning in Healthcare

Federated learning (FL) is a distributed learning paradigm in which multiple sites train local models on private data and periodically share model updates with a coordinator, which aggregates them into a global model [7,9]. In healthcare, FL is particularly attractive because medical data are naturally distributed across institutions and are subject to strict privacy, legal, and governance constraints. By keeping data local and exchanging model updates instead of raw records, FL provides a practical alternative to centralised learning for collaborative healthcare analytics.

Although FL is promising for healthcare, several practical and methodological challenges remain. First, statistical heterogeneity is common. Hospitals may differ in patient demographics, clinical workflows, measurement devices, and coding practices, which can lead to client drift and unstable training [8,9]. Second, system heterogeneity can arise from differences in compute capacity, network conditions, and local operational constraints. Third, privacy risks do not disappear simply because raw data are not shared. Prior studies have shown that gradients or model updates may still leak sensitive information under certain attack settings, such as gradient inversion or feature leakage attacks [21,22]. Therefore, healthcare FL systems must consider not only predictive performance, but also update confidentiality, secure aggregation, and operational trust.

2.2. Update Confidentiality and Secure Aggregation

Protecting privacy in FL requires distinguishing two related goals: update confidentiality and secure aggregation. Update confidentiality aims to prevent eavesdroppers or untrusted intermediaries from reading individual client updates in transit or at rest. Secure aggregation aims to ensure that the coordinator learns only an aggregate statistic, such as a sum or an average, rather than any single participant’s update.

Secure aggregation is commonly implemented using ideas from secure multi-party computation (SMPC). A representative approach is additive masking, where each client masks its update in such a way that the masks cancel out during aggregation, allowing the server to recover only the sum of updates [17]. Such protocols are attractive because they can be efficient and provide a clear privacy guarantee under an explicit threat model, often an honest-but-curious server. However, practical deployment still requires solutions for dropout handling, key management, and robustness to stragglers.

Homomorphic encryption (HE) provides another way to protect model updates by allowing computation directly over encrypted values [23]. In FL, HE is often used so that the coordinator performs additive aggregation on ciphertexts rather than plaintext updates. This can provide stronger confidentiality for transmitted updates, but it also introduces additional computation and communication overhead. As a result, many systems often restrict operations to simple additions, smaller models, or lightweight HE settings in order to balance privacy and efficiency.

2.3. Auditability, Governance Evidence, and Trust in Cross-Organisation Collaboration

In cross-organisation healthcare collaborations, technical privacy mechanisms alone are often not enough for deployment readiness. Partners may also require operational evidence that collaboration occurred as claimed and that protocol constraints were followed. For example, they may want evidence that local training was performed on-site, no raw data were exported, and only restricted message contents were exchanged. This introduces a requirement beyond predictive performance: auditability and governance readiness. Prior work in healthcare FL has highlighted the importance of trust, transparency, and practical deployment considerations, but many FL systems mainly expose training metrics and engineering logs rather than a structured, non-sensitive evidence layer designed for post hoc review and partner assurance [11,12].

2.4. Related Work

Federated learning has been widely applied in supporting collaborative healthcare analytics [24,25,26,27]. Existing works have explored FL in a range of healthcare settings, including electronic health records, medical imaging, remote monitoring, and other smart-health applications, showing that multi-site model development is feasible even when direct data sharing is restricted [25,26]. In addition, these reviews consistently note that FL in the healthcare domain still faces challenges such as Non-IID data, communication overhead, privacy risk, and deployment complexity [24,26,27].

A number of existing works have therefore focused on strengthening the security and privacy of FL beyond simple local-data retention. In particular, secure aggregation has become a central topic, since the coordinator should ideally learn only an aggregated statistic rather than any individual client update. Studies in [17,24,26] have explored several directions for secure aggregation, including masking-based schemes, secret sharing, and cryptographic approaches such as homomorphic encryption, while also emphasising practical requirements such as communication efficiency, robustness to user dropout, and compatibility with high-dimensional model updates. These approaches strengthen confidentiality beyond the assumption that data merely remain local, although they may introduce additional computational and engineering overhead in real deployments [23,25,27]. Recent efficient secure aggregation protocols, such as SecAgg+ and LightSecAgg, further demonstrate the importance of scalable secure aggregation, reduced communication cost, and dropout handling in large-scale FL settings [28,29]. These works provide useful reference points for future extensions of MediVault beyond the current two-party masking prototype.

Robust aggregation has also been studied as a defence against poisoning and manipulation in federated settings. For example, recent work on robust android malware detection has investigated label-flipping attacks and defences, while GRAF-IDS uses graph-based clustering as an aggregation strategy for federated intrusion detection in IoT networks [30,31]. These studies show that update confidentiality alone is not sufficient to address all adversarial risks; robust aggregation and attack-aware validation remain important extensions for systems such as MediVault.

At the same time, prior research has shown that “data stay local” does not fully remove privacy risks in federated learning. Attacks on gradients and parameter updates demonstrate that sensitive information may still be inferred from shared model updates, which motivates stronger update-protection mechanisms in privacy-sensitive domains such as healthcare [21,22]. This concern is also reflected in recent healthcare FL reviews, which argue that privacy protection often requires a combination of local training, protected update exchange, and system-level safeguards rather than reliance on local data retention alone [24,25,26].

From a systems perspective, existing FL research provides strong foundations for federated coordination, local training, privacy-aware aggregation, and healthcare-oriented application design [24,25,26,27]. However, most existing works primarily focus on learning architectures, privacy mechanisms, or application taxonomies. Much less attention is given to governance-oriented transparency at the system level, such as protocol timelines, secure message evidence, and non-sensitive audit views that can support partner assurance in real healthcare collaborations. MediVault builds on these prior directions but does not claim novelty in the underlying FL optimiser or cryptographic primitives. Its contribution is system-oriented: it integrates FL, HE-based update protection, SMPC-inspired aggregation, and an audit-oriented evidence layer into a healthcare-focused workflow. Compared with generic FL frameworks, MediVault explicitly treats governance-oriented transparency, protocol evidence, and stakeholder-facing reporting as first-class system requirements rather than auxiliary implementation details.

3. Proposed System

The proposed system, called MediVault, is an auditable and security-aware federated learning-based system that supports collaborative model training across multiple healthcare data custodians without centralising patient-level data. The system is designed for healthcare analytics scenarios in which institutions need to collaborate while preserving local data control. MediVault follows a federated learning (FL) setting in which participating healthcare sites train locally on their private datasets and share only protected model updates. Rather than treating FL as an isolated training routine, MediVault provides an integrated workflow that combines round orchestration, protected update exchange, secure aggregation, and audit-ready system evidence.

3.1. System Architecture

Figure 1 shows the overall MediVault workflow. In each training round, the federated coordinator broadcasts the current global model to the participating healthcare sites. Each site then performs local training on its private dataset and computes a local model update. Before transmission, the local update is protected through an HE-based update protection step. The protected updates are then combined through an SMPC-inspired secure aggregation process, so that the coordinator receives only an aggregated result rather than plaintext individual updates. This aggregated result is used to form the updated global model, which is broadcast again for the next round.

MediVault combines a federated coordinator, site-level local training at peer nodes, and a protected update pipeline for secure submission and aggregation. Together, these elements support a protected round-based workflow in which local model updates are generated at each site, protected before transmission, securely aggregated, and then used to update the global model for the next round. The round-based training procedure is described next, followed by the two protection mechanisms used for secure update handling and aggregation.

3.2. Threat Model and Security Scope

MediVault is designed under an honest-but-curious coordinator setting. The coordinator is assumed to follow the protocol for model broadcast, update collection, and aggregation, but may attempt to infer information from the updates it receives. Participating healthcare sites keep raw patient data locally and transmit only protected model updates. Under this scope, HE protects updates during transmission and encrypted aggregation, while the SMPC-inspired masking mechanism reduces coordinator visibility of individual site updates.

Prior work has shown that gradients and model updates can leak sensitive information under certain attack settings [21,22]. MediVault therefore focuses on reducing update exposure rather than relying only on the statement that data remain local. The current implementation does not provide a formal treatment of malicious model poisoning, Byzantine clients, collusion among compromised parties, or dropout-tolerant multi-party secure aggregation. In addition, the evidence logs are intended for operational auditability and are not yet implemented as cryptographically tamper-evident logs. Stronger guarantees, including collusion-resistant secure aggregation, signed append-only logs, hash chaining, trusted timestamping, and formal cryptographic proofs, are left for future work.

HE-based aggregation also does not by itself prevent inference from aggregate outputs or the final model. Recent robust FL studies on label-flipping attacks and graph-based clustering aggregation highlight the importance of combining update confidentiality with robust aggregation and attack detection [30,31]. Integrating these defences into MediVault is left for future work. The security mechanisms in MediVault are therefore presented as prototype-level update-protection components under the stated honest-but-curious setting, rather than as formally proven cryptographic guarantees.

3.3. Federated Learning Workflow

Assume that training proceeds in synchronous rounds

t = 1, \dots, T

. Let

w^{(t)}

denote the global model parameters at round t. Each peer

i \in {1, \dots, N}

holds a private local dataset

D_{i}

. The workflow below describes how local updates are generated and then passed to the protected update pipeline for secure submission and aggregation.

1.: Broadcast: The coordinator broadcasts the current global model $w^{(t)}$ and round identifier t to all participating peers.
2.: Local training: Each peer performs local optimisation for E epochs (or steps) and obtains updated parameters $w_{i}^{(t)}$ . The local model update is then computed as

$Δ w_{i}^{(t)} = w_{i}^{(t)} - w^{(t)} .$

(1)
3.: Protected submission: Each peer protects $Δ w_{i}^{(t)}$ using the protected update pipeline described in Section 3.4 and Section 3.5, and submits only the protected update to the coordinator.
4.: Aggregation and model update: The coordinator aggregates the protected updates and applies the resulting global update:

$w^{(t + 1)} = w^{(t)} + η \cdot \frac{1}{N} \sum_{i = 1}^{N} Δ w_{i}^{(t)},$

(2)

where $η$ is the server learning rate.

In MediVault, the summation is not carried out over plaintext individual updates. Instead, aggregation is performed through the protected update pipeline described below.

3.4. HE-Based Update Protection for Encrypted Aggregation

To protect the confidentiality of peer updates during transmission and aggregation, MediVault uses an additive homomorphic cryptosystem, specifically Paillier. Let

Enc (\cdot)

and

Dec (\cdot)

denote encryption and decryption under the corresponding public and private keys.

Each peer encrypts its protected update vector element-wise:

c_{i}^{(t)} = Enc ({\tilde{Δ w}}_{i}^{(t)}),

(3)

where

{\tilde{Δ w}}_{i}^{(t)}

denotes the peer update after optional masking. Due to the additive homomorphism of Paillier, the coordinator can combine ciphertexts without decrypting individual updates:

c_{sum}^{(t)} = ⨁_{i = 1}^{N} c_{i}^{(t)} = Enc (\sum_{i = 1}^{N} {\tilde{Δ w}}_{i}^{(t)}),

(4)

where ⊕ denotes ciphertext-domain addition. The coordinator decrypts only the aggregated ciphertext:

{\tilde{Δ w}}_{sum}^{(t)} = Dec (c_{sum}^{(t)}) .

(5)

This design prevents the coordinator from directly observing plaintext individual updates during aggregation under the honest-but-curious coordinator assumption. In the current prototype, this encrypted aggregation remains practical because the evaluated models are lightweight and keep the protected update dimensionality manageable.

3.5. SMPC-Inspired Secure Aggregation via Additive Masking

MediVault further reduces exposure of individual updates by combining HE with an SMPC-inspired additive masking mechanism. The goal is that the coordinator receives only encrypted, masked updates and recovers only an aggregated result.

Let

Δ w_{i}^{(t)}

denote peer i’s local model update at round t, and let

m_{i}^{(t)}

denote a pseudo-random mask vector derived from a shared seed and the round identifier t. Peer i forms a masked update as

{\tilde{Δ w}}_{i}^{(t)} = Δ w_{i}^{(t)} + s_{i} m_{i}^{(t)},

(6)

where

s_{i} \in {+ 1, - 1}

controls mask cancellation. The peer then encrypts and transmits only

Enc ({\tilde{Δ w}}_{i}^{(t)}) .

(7)

Using HE additivity, the coordinator combines ciphertexts and decrypts only the aggregated masked sum:

Dec (\sum_{i = 1}^{N} Enc ({\tilde{Δ w}}_{i}^{(t)})) = \sum_{i = 1}^{N} (Δ w_{i}^{(t)} + s_{i} m_{i}^{(t)}) .

(8)

In the current prototype, masking is implemented in a two-party setting (

N = 2

) by assigning opposite signs to the two peers so that masks cancel after aggregation:

{\tilde{Δ w}}_{1}^{(t)} + {\tilde{Δ w}}_{2}^{(t)} = Δ w_{1}^{(t)} + Δ w_{2}^{(t)} .

(9)

Thus, the coordinator recovers only the aggregated update and not any individual plaintext update under the stated threat model. The aggregated update is then used in a FedAvg-style global model update. Extending this masking mechanism to larger multi-party settings with dropout tolerance, collusion resistance, latency analysis, and formal security analysis is left as future work.

In addition, MediVault records round-level metadata, including round identifiers, participating peers, protected message metadata, aggregation status, and model-level summaries. These records are surfaced through the dashboard to support auditability and governance review without exposing patient-level data.

4. Evaluation

This section evaluates MediVault from two aspects: (i) system-level evidence, showing that the current implementation supports end-to-end execution with a working dashboard, protected update exchange, and an auditable protocol timeline; (ii) model-level utility, comparing federated training against a centralised baseline under both IID and Non-IID data partitions. We primarily report results for two lightweight linear classifiers, logistic regression (LOGREG) and linear SVM (LINSVM), and additionally include a lightweight MLP experiment to address non-linear model behaviour under representative settings.

4.1. Implementation and Dashboard Views

A key contribution of MediVault is that the protected collaboration workflow is not only specified conceptually but also demonstrated through an operational dashboard. Figure 2, Figure 3 and Figure 4 provide end-to-end evidence of: (i) global task configuration and round-level learning status at the coordinator; (ii) peer-side execution where each site trains locally and submits encrypted, masked model updatesrather than raw patient records. These views support a deployment-oriented narrative: the current implementation operationalises secure multi-party collaboration while preserving data locality.

In addition to the primary workflow, MediVault provides a dedicated collaboration evidence layer, as shown in Figure 5 and Figure 6. This layer is designed to improve auditability and partner confidence by exposing protocol-level artefacts, such as message metadata, encryption timings, and aggregation steps, while remaining non-sensitive. Such evidence is particularly relevant for healthcare collaborations where governance requirements demand operationally verifiable traces without disclosure of patient-level information. In addition, Figure 7 shows an optional reporting interface that generates narrative summaries from non-sensitive aggregated evidence rather than raw patient records. This interface is intended to support stakeholder communication by translating logged metrics and protocol-level evidence into a more accessible form, and can be achieved using either a cloud-based generative AI service or a local model. The reporting interface follows a data-minimisation design: inputs are limited to aggregated metrics, protocol events, message metadata, and optional site-level aggregates, while raw patient records and per-sample data are excluded. For sensitive deployments, a local model can be used to avoid exporting even aggregated evidence to a third-party service, and the generated summaries are treated as communication support rather than clinical decision outputs.

4.2. Experimental Setup

We evaluate MediVault on two public binary classification datasets: Breast Cancer Wisconsin (Diagnostic) (breast_cancer) [32] and Heart Disease (heart_disease) [33]. Each dataset is split into training and test partitions using a fixed random seed (seed = 7) and an 80/20 stratified split to preserve class proportions. All reported metrics are computed on the held-out test set. The current evaluation is intended as an initial system-level validation using public tabular healthcare benchmarks, rather than a comprehensive benchmark across all FL frameworks, clinical datasets, and deployment settings.

4.2.1. Models, Baselines, and FL Setting

We primarily compare two lightweight linear models that are common in clinical risk prediction, and additionally evaluate a small MLP to test whether the workflow supports a non-linear classifier:

LOGREG: logistic regression (probabilistic linear classifier).
LINSVM: linear SVM (margin-based linear classifier).
MLP: a lightweight feed-forward neural network with one hidden layer of 32 units and ReLU activation, evaluated under representative 5-peer settings.

For each model, we compare:

Centralised baseline (Non-FL): the model trained on the union of all training data.
Federated learning (FL): Peers train locally and submit model updates to a coordinator. The coordinator applies a FedAvg-style aggregation over received updates and evaluates the global model each round.

4.2.2. Peer Partitions (IID vs. Non-IID)

To study heterogeneity, training data are partitioned across peers under:

IID: each peer receives a roughly representative sample of the overall data distribution.
Non-IID: peer data distributions are intentionally skewed so that different peers no longer follow the same underlying distribution, reflecting realistic site heterogeneity.

The Non-IID setting is implemented as a label-skew partition, where local peer datasets are assigned different class proportions to approximate site-level case-mix differences across hospitals. For the representative 5-peer MLP experiment, the target class-skew schedule ranges from approximately 0.70 to 0.30 across peers. This provides a simple quantitative heterogeneity control, although it does not fully capture richer clinical heterogeneity such as feature shift, coding variation, missingness, or temporal drift. We evaluate 2 peers and 5 peers to examine how scaling the number of sites influences convergence and performance. Larger client populations, client dropout, and end-to-end latency are not fully evaluated in the current prototype and are discussed as future work.

4.2.3. Metrics and Reporting Protocol

We report three standard metrics for medical risk prediction:

Accuracy (ACC): overall classification correctness.
Area Under the ROC Curve (AUC): threshold-independent ranking quality.
F1-score (F1): balances precision and recall, which is useful under potential class imbalance.

For each setting, we report: (i) Final round performance (fixed-budget deployment view), (ii) Best-over-rounds (attainable peak, relevant for early stopping), and (iii) Mean ± Std across rounds (stability). In Table 1, Table 2 and Table 3,

Δ

denotes FL minus Base under the same dataset/model/partition/peer configuration. The current evaluation does not claim statistical significance across multiple random seeds. The reported differences should therefore be interpreted as prototype-level empirical evidence under a fixed reproducible split; multi-seed statistical testing is identified as future work.

4.2.4. Implementation and Reproducibility Details

For LOGREG and LINSVM, experiments use 20 global rounds, one local epoch per peer per round, and a server learning rate of 0.01. The same held-outand test set is used for both centralised and federated evaluation. For the MLP experiment, we use 80 global rounds, three local epochs per peer per round, a batch size of 16, a learning rate of

10^{- 3}

, and FedAvgM server momentum of 0.5. In the protected-update prototype and overhead measurement, Paillier-style additive HE with a 1024-bit key is applied element-wise to fixed-point encoded model update vectors. The implementation records update dimensionality, protected payload size, encryption time, aggregation status, and round identifiers as dashboard metadata. All experiments use the fixed random seed described above to support reproducibility.

4.2.5. Secure Update Confidentiality and Secure Aggregation

MediVault follows an update-confidential FL design that combines homomorphic encryption (HE) with an SMPC-inspired additive masking mechanism (see the full protocol description in the Proposed System section). As shown in Figure 5 and Figure 6, we validate the operational behaviour of this design by exposing non-sensitive protocol artefacts in the dashboard: (i) each peer submits only encrypted, masked updates (no plaintext updates and no patient-level records); (ii) the coordinator performs additive combination on ciphertexts and decrypts only the aggregated sum; (iii) the secure-aggregation trace view provides message-level evidence such as vector dimensionality, payload size, mask identifiers/signs, and ciphertext hashes or samples, together with an ordered protocol timeline for auditability. These dashboard traces demonstrate that encrypted update exchange and masked aggregation are executed end-to-end in the prototype, supporting partner assurance without revealing local training data.

4.3. Auditability and Governance Evidence

Beyond predictive performance, MediVault is evaluated on auditability —the ability to provide non-sensitive, machine-recorded evidence that a privacy-preserving collaboration occurred. This is particularly important for cross-organisation healthcare deployments where partners must justify data governance decisions and demonstrate compliance-oriented controls.

As shown in Figure 5 and Figure 6, the implementation exposes an evidence layer that logs: (i) secure message metadata (peer identifier, round index, payload size, encryption time, mask identifiers, and protocol events); (ii) an ordered protocol timeline of events (round start, message receipt, secure combine, global update, and evaluation). These artefacts are designed to be non-patient-level yet operationally verifiable, supporting post hoc inspection and partner assurance without disclosing local records or per-sample information. In this paper, auditability is evaluated using three practical criteria: whether round-level evidence is recorded, whether the recorded evidence is non-sensitive, and whether the dashboard supports post hoc inspection of training and aggregation events. These criteria do not constitute formal compliance certification and do not replace specialised monitoring platforms, but they provide an explicit basis for evaluating the governance evidence produced by the prototype.

More formally, we treat auditability as a set of measurable evidence properties rather than as a single accuracy-like score. For a training round r, let

A_{r} = (T_{r}, D_{r}, I_{r}, V_{r})

, where

T_{r}

denotes trace completeness,

D_{r}

denotes data minimisation,

I_{r}

denotes post hoc inspectability, and

V_{r}

denotes verifiability or tamper-evidence of the recorded log. In the current prototype,

T_{r} = 1

when the round records the participating peers, protected message metadata, aggregation status, and model-level summaries;

D_{r} = 1

when these artefacts exclude patient-level and per-sample information; and

I_{r} = 1

when the evidence can be reviewed through the dashboard after execution. The current implementation supports

T_{r}

,

D_{r}

, and

I_{r}

through structured dashboard evidence, but it does not yet provide cryptographic tamper-resistance for

V_{r}

through mechanisms such as hash chaining, digital signatures, append-only storage, or trusted timestamping. This definition distinguishes the proposed evidence layer from generic system logging by linking logs to governance-oriented FL events and by making the remaining log-integrity gap explicit.

We therefore treat auditability as a first-class evaluation axis alongside accuracy metrics: the dashboard evidence demonstrates that MediVault provides a practical governance view for protected collaboration in addition to model training outcomes.

4.4. Results: Performance Comparison (Centralised vs. Federated)

Table 1, Table 2 and Table 3 summarise results for LOGREG and LINSVM. On breast_cancer, both models achieve near-ceiling performance centrally, and FL remains competitive in the observed results: LOGREG largely matches the centralised baseline (final-round ACC ≈ 0.986), while LINSVM shows small drops under FL (e.g., up to ∼2–3% absolute ACC under 2-peer Non-IID). On heart_disease, LOGREG yields the strongest centralised baseline, and FL shows slightly higher observed values under IID partitioning (e.g., ACC changes from 0.853 to 0.868). Under Non-IID partitions, final-round metrics can drop (ACC 0.838) but best-over-rounds remains competitive in this setting, suggesting that monitoring and early stopping may be practical strategies under heterogeneous deployments. LINSVM exhibits more sensitivity across configurations: it can match or show higher observed values than the baseline in some cases (e.g., 2-peer Non-IID ACC 0.836 vs. 0.803) but degrades under others (e.g., 5-peer IID ACC 0.787 vs. 0.803), indicating higher variance in heterogeneous or small-sample regimes.

To address the limitation of evaluating only linear models, we additionally include a lightweight MLP under representative 5-peer settings. As shown in Table 4, the MLP results indicate that the MediVault workflow can also support a non-linear classifier. On breast_cancer, FL closely matches the centralised baseline and shows slightly higher observed values in some metrics. On heart_disease, FL also shows higher observed final-round values than the centralised MLP baseline, although the round-wise results still show that non-linear models can benefit from monitoring and early stopping. These results suggest that the proposed workflow is not limited to linear classifiers, while more extensive evaluation with deeper models remains future work.

To quantify the computational and communication cost of protected update exchange, we conduct a prototype-level overhead measurement for representative update vectors. The overhead experiment was conducted on a local Mac OS machine with an Apple M4 processor and 16 GB RAM, using a Python-based (Python 3.12) Paillier implementation. Each setting was repeated five times, and mean values are reported in Table 5. The table reports update dimensionality, plaintext payload size, encrypted payload size, encryption time, ciphertext aggregation time, and decryption time under a 5-peer setting. These measurements are intended to characterise prototype-level overhead rather than optimised cryptographic performance.

The results show that the overhead increases with update dimensionality. For example, the encrypted payload per peer increases from approximately 9.42 KB for the breast_cancer linear models to 285.28 KB for the breast_cancer MLP. Similarly, encryption time per peer increases from approximately 258–259 ms for the linear models to 8679 ms for the MLP. These results confirm that HE-based protection introduces non-negligible computational and communication overhead, especially for larger update vectors, and motivates future optimisation and bandwidth-aware deployment planning.

FedAvg is used as the baseline aggregation rule because the purpose of this work is to evaluate the MediVault system workflow under a standard and widely understood FL training procedure rather than to propose a new optimiser. Under Non-IID data, FedAvg can be sensitive to client drift; therefore, the evaluation reports final-round, best-over-round, and mean ± std results to characterise stability, and the MLP experiment uses FedAvgM as a lightweight mitigation strategy. More advanced robust or personalised aggregation methods are compatible with the architecture and are left for future work.

Figure 8, Figure 9 and Figure 10 visualise final-round performance for Base vs. FL across datasets, partitions, peer counts, and models. Across the evaluated conditions, the observed FL performance is generally competitive with the centralised baseline. Differences are small on breast_cancer due to near-ceiling performance, while larger sensitivity is observed on heart_disease, particularly under Non-IID partitions and for LINSVM.

To complement the summary tables, we plot round-wise ACC/AUC/F1 trajectories under representative IID and Non-IID settings to illustrate convergence behaviour and stability. The distinction between final-round and best-over-round performance is particularly useful under heterogeneous data, where the final operating point may fluctuate even when strong intermediate rounds are reached.

Figure 11 shows that under IID partitioning, FL converges smoothly and can match or exceed the centralised baseline. Under Non-IID partitioning, convergence remains observable but with larger fluctuations and a lower final-round operating point. This pattern is consistent with client drift and slower stabilisation under heterogeneous sites, and it explains why monitoring best-over-round performance is useful when client distributions are skewed.

Figure 12 indicates that both LOGREG and LINSVM achieve near-ceiling performance for breast_cancer, and FL closely tracks the centralised baseline under both IID and Non-IID partitions. In this dataset, differences are small and are more visible through stability than through final-point performance.

4.5. Discussion

The evaluation shows that MediVault provides both system-level and model-level value for privacy-preserving healthcare collaboration. At the system level, the dashboard and evidence views demonstrate that the current implementation supports local training, protected update exchange, secure aggregation, and auditable protocol traces without exposing patient-level data. This is important for healthcare deployments, where partner trust depends not only on privacy-preserving computation but also on visible and reviewable operational evidence.

At the model level, federated learning with LOGREG, LINSVM, and the additional lightweight MLP remains competitive with the centralised baseline across the evaluated datasets. On breast_cancer, the models operate near a saturated performance regime, so differences between centralised and federated settings are small. On heart_disease, LOGREG is generally more robust across partition and peer configurations, while LINSVM shows greater sensitivity. The MLP experiment further shows that the workflow can support a simple non-linear classifier, but that round-wise monitoring remains important for selecting stable operating points.

The results also confirm the expected effect of heterogeneity. Non-IID partitions increase variance and can lower final-round performance, especially on heart_disease. At the same time, the best-over-rounds results indicate that competitive operating points are still reachable, which supports the use of monitoring and early stopping in practical deployments. The current evaluation remains limited to public tabular datasets, 2–5 peers for the main experiments, and representative 5-peer MLP settings; larger-scale clients, dropout, and latency measurements remain to be explored in future work.

Finally, the optional reporting interface shown in Figure 7 illustrates how non-sensitive aggregated evidence can be presented in a more accessible form for stakeholders. Together with the protocol-level metadata exposed by the dashboard, this suggests that MediVault can support not only privacy-preserving training, but also the transparency and governance readiness needed for real multi-site healthcare collaboration. The added HE overhead measurement further shows that protected update exchange introduces measurable computational and payload costs, especially for larger update vectors such as the MLP. This reinforces the need for optimised cryptographic implementation and bandwidth-aware deployment planning.

5. Conclusions

This paper presented MediVault, an auditable and security-aware federated learning-based system that enables privacy-preserving healthcare collaboration without sharing raw patient records. MediVault combines federated learning with prototype encrypted update exchange and an SMPC-inspired secure aggregation workflow, and exposes an auditable evidence layer through a working dashboard to support governance and partner trust. Experiments on breast_cancer and heart_disease using LOGREG, LINSVM, and an additional lightweight MLP suggest that federated training can remain competitive with a centralised baseline under the evaluated IID and Non-IID settings, with expected sensitivity to data heterogeneity.

Several directions remain for future work. First, the current secure aggregation and homomorphic-encryption mechanisms are demonstrated at prototype level, and future work should consider stronger adversarial settings, collusion resistance, dropout-tolerant multi-party secure aggregation, and more formal security analysis. Second, the present experiments use benchmark tabular datasets rather than real clinical deployment data, so further validation will require more representative healthcare datasets and appropriate governance or ethical pathways. Third, broader evaluation is needed under realistic network conditions, larger numbers of peers, client dropout, high-dimensional healthcare datasets, deeper neural architectures, multi-seed statistical significance testing, comparisons with established FL frameworks such as Flower, FedML, and FATE, and more extensive analyses of robustness, fairness, distribution shift, model inversion risk, poisoning resilience, network-level latency, and optimised cryptographic overhead.

Author Contributions

Conceptualization, J.L.; methodology, J.L. and U.A.; software, J.L.; validation, J.L. and U.A.; formal analysis, J.L.; investigation, U.A. and M.S.A.; resources, U.A. and M.S.A.; data curation, J.L.; writing—original draft preparation, J.L. and U.A. and M.S.A.; writing—review and editing, J.L., U.A. and M.S.A.; visualisation, J.L.; supervision, J.L., U.A. and M.S.A.; project administration, J.L., U.A. and M.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

The project is funded by Cyber Security Academic Startup Accelerator Programme (Year 9), Project Code: 10173383.

Data Availability Statement

The datasets used in this study are publicly available. The Breast Cancer Wisconsin (Diagnostic) dataset is available from the UCI Machine Learning Repository and is also accessible through the scikit-learn datasets module. The Heart Disease dataset is available from the UCI Machine Learning Repository. The source code for the MediVault prototype is publicly available at GitHub: https://github.com/lij008/MediVault (accessed on 19 May 2026). No new patient-level clinical dataset was created in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zuo, Z.; Li, J.; Xu, H.; Al Moubayed, N. Curvature-based feature selection with application in classifying electronic health records. Technol. Forecast. Soc. Chang. 2021, 173, 121127. [Google Scholar] [CrossRef]
Brasil, S.; Pascoal, C.; Francisco, R.; dos Reis Ferreira, V.; A. Videira, P.; Valadão, G. Artificial intelligence (AI) in rare diseases: Is the future brighter? Genes 2019, 10, 978. [Google Scholar] [CrossRef]
Lee, J.; Liu, C.; Kim, J.; Chen, Z.; Sun, Y.; Rogers, J.R.; Chung, W.K.; Weng, C. Deep learning for rare disease: A scoping review. J. Biomed. Inform. 2022, 135, 104227. [Google Scholar] [CrossRef] [PubMed]
Visibelli, A.; Roncaglia, B.; Spiga, O.; Santucci, A. The impact of artificial intelligence in the odyssey of rare diseases. Biomedicines 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
Decherchi, S.; Pedrini, E.; Mordenti, M.; Cavalli, A.; Sangiorgi, L. Opportunities and challenges for machine learning in rare diseases. Front. Med. 2021, 8, 747612. [Google Scholar] [CrossRef]
Schaefer, J.; Lehne, M.; Schepers, J.; Prasser, F.; Thun, S. The use of machine learning in rare diseases: A scoping review. Orphanet J. Rare Dis. 2020, 15, 145. [Google Scholar] [CrossRef] [PubMed]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics; Pmlr; JMLR: Fort Lauderdale, FL, USA, 2017; pp. 1273–1282. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Kairouz, P.; McMahan, H.B. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Sheller, M.J.; Reina, G.A.; Edwards, B.; Martin, J.; Bakas, S. Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation. In Proceedings of the International MICCAI Brainlesion Workshop; Springer: Berlin/Heidelberg, Germany, 2018; pp. 92–104. [Google Scholar]
Rieke, N.; Hancox, J.; Li, W.; Milletarì, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The future of digital health with federated learning. npj Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef]
Xu, J.; Glicksberg, B.S.; Su, C.; Walker, P.; Bian, J.; Wang, F. Federated learning for healthcare informatics. J. Healthc. Inform. Res. 2021, 5, 1–19. [Google Scholar] [CrossRef]
Austin, J.A.; Lobo, E.H.; Samadbeik, M.; Engstrom, T.; Philip, R.; Pole, J.D.; Sullivan, C.M. Decades in the making: The evolution of digital health research infrastructure through synthetic data, common data models, and federated learning. J. Med. Internet Res. 2024, 26, e58637. [Google Scholar] [CrossRef]
Shafik, W. Digital healthcare systems in a federated learning perspective. In Federated Learning for Digital Healthcare Systems; Elsevier: Amsterdam, The Netherlands, 2024; pp. 1–35. [Google Scholar]
Bashir, A.K.; Victor, N.; Bhattacharya, S.; Huynh-The, T.; Chengoden, R.; Yenduri, G.; Maddikunta, P.K.R.; Pham, Q.V.; Gadekallu, T.R.; Liyanage, M. Federated learning for the healthcare metaverse: Concepts, applications, challenges, and future directions. IEEE Internet Things J. 2023, 10, 21873–21891. [Google Scholar] [CrossRef]
Milasheuski, U.; Barbieri, L.; Tedeschini, B.C.; Nicoli, M.; Savazzi, S. On the impact of data heterogeneity in federated learning environments with application to healthcare networks. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI); IEEE: New York, NY, USA, 2024; pp. 1017–1023. [Google Scholar]
Bonawitz, K.; Ivanov, V.; Kreuter, B.; Marcedone, A.; McMahan, H.B.; Patel, S.; Ramage, D.; Segal, A.; Seth, K. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1175–1191. [Google Scholar]
Beutel, D.J.; Topal, T.; Mathur, A.; Qiu, X.; Fernandez-Marques, J.; Gao, Y.; Sani, L.; Li, K.H.; Parcollet, T.; Gusmao, P.P.B.D.; et al. Flower: A Friendly Federated Learning Research Framework. arXiv 2020, arXiv:2007.14390. [Google Scholar]
He, C.; Li, S.; So, J.; Zeng, X.; Zhang, M.; Wang, H.; Wang, X.; Vepakomma, P.; Singh, A.; Qiu, H.; et al. Fedml: A research library and benchmark for federated machine learning. arXiv 2020, arXiv:2007.13518. [Google Scholar] [CrossRef]
Liu, Y.; Fan, T.; Chen, T.; Xu, Q.; Yang, Q. FATE: An industrial grade platform for collaborative learning with data protection. J. Mach. Learn. Res. 2021, 22, 1–6. [Google Scholar]
Zhu, L.; Liu, Z.; Han, S. Deep Leakage from Gradients. In Proceedings of the Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Melis, L.; Song, C.; De Cristofaro, E.; Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP); IEEE: New York, NY, USA, 2019; pp. 691–706. [Google Scholar]
Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing; ACM: New York, NY, USA, 2009; pp. 169–178. [Google Scholar]
Antunes, R.S.; André da Costa, C.; Küderle, A.; Yari, I.A.; Eskofier, B. Federated learning for healthcare: Systematic review and architecture proposal. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–23. [Google Scholar] [CrossRef]
Chaddad, A.; Wu, Y.; Desrosiers, C. Federated learning for healthcare applications. IEEE Internet Things J. 2023, 11, 7339–7358. [Google Scholar] [CrossRef]
Nguyen, D.C.; Pham, Q.V.; Pathirana, P.N.; Ding, M.; Seneviratne, A.; Lin, Z.; Dobre, O.; Hwang, W.J. Federated learning for smart healthcare: A survey. ACM Comput. Surv. (Csur) 2022, 55, 1–37. [Google Scholar] [CrossRef]
Dhade, P.; Shirke, P. Federated learning for healthcare: A comprehensive review. Eng. Proc. 2024, 59, 230. [Google Scholar]
Bell, J.H.; Bonawitz, K.A.; Gascón, A.; Lepoint, T.; Raykova, M. Secure Single-Server Aggregation with (Poly)Logarithmic Overhead. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security; ACM: New York, NY, USA, 2020; pp. 1253–1269. [Google Scholar]
So, J.; He, C.; Yang, C.S.; Li, S.; Yu, Q.; Ali, R.E.; Guler, B.; Avestimehr, S. LightSecAgg: A Lightweight and Versatile Design for Secure Aggregation in Federated Learning. Mach. Learn. Syst. 2022, 4, 694–720. [Google Scholar]
Eslamnejad, M.; Taheri, R.; Shojafar, M.; Bader-El-Den, M. Federated learning-based robust android malware detection: Label-flipping attacks and defenses. Neural Comput. Appl. 2025, 37, 27057–27082. [Google Scholar] [CrossRef]
Rezaei, H.; Taheri, R.; Shojafar, M.; Foh, C.H. GRAF-IDS: Graph-based clustering as aggregation for federated intrusion detection system in IoT network. Neural Comput. Appl. 2025, 37, 18401–18423. [Google Scholar] [CrossRef]
Zwitter, M.; Soklic, M. Breast Cancer. In UCI Machine Learning Repository; University Medical Centre: Ljubljana, Yugoslavia, 1988. [Google Scholar] [CrossRef]
Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Detrano, R. Heart Disease. In UCI Machine Learning Repository; University Medical Centre: Ljubljana, Yugoslavia, 1989. [Google Scholar]

Figure 1. Overview of the MediVault workflow.

Figure 2. Coordinator dashboard overview showing the current learning status, update dimensionality, and aggregated non-sensitive site snapshot.

Figure 3. Peer A view. Local data remain on-site; only encrypted and masked updates are exchanged, while local non-sensitive aggregates are visualised.

Figure 4. Peer B view. Local data remain on-site; only encrypted and masked updates are exchanged, while local non-sensitive aggregates are visualised.

Figure 5. Encrypted and masked updates submitted by peers. The coordinator decrypts only the sum of masked ciphertexts; masks cancel in aggregate and the global update is applied in the current SMPC-inspired prototype.

Figure 6. Protected message table and protocol timeline. The dashboard records message sizes, encryption time, mask identifiers, and protocol events to support auditability.

Figure 7. Optional LLM Insights interface: evidence-grounded summaries generated from aggregated metrics and site-level aggregates, without using patient-level records.

Figure 8. Final-round Accuracy (ACC) comparison between centralised baseline (Base) and federated learning (FL) (LOGREG vs. LINSVM), partitions (IID/Non-IID).

Figure 9. Final-round AUC comparison between centralised baseline (Base) and federated learning (FL) (LOGREG vs. LINSVM), partitions (IID/Non-IID).

Figure 10. Final-round F1-score (F1) comparison between centralised baseline (Base) and federated learning (FL), (LOGREG vs. LINSVM), partitions (IID/Non-IID).

Figure 11. Round-wise performance curves for heart_disease comparing LOGREG and LINSVM (Base vs. FL). (a) IID, 5 peers. (b) Non-IID, 5 peers, where heterogeneity increases variance and affects the final operating point.

Figure 12. Round-wise performance curves for breast_cancer comparing LOGREG and LINSVM (Base vs. FL). (a) IID, 5 peers, where both models achieve near-ceiling performance. (b) Non-IID, 5 peers, where FL remains stable with only limited degradation.

Table 1. Accuracy (ACC) comparison between centralised training (Base) and MediVault federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum ACC achieved across rounds.

Table 1. Accuracy (ACC) comparison between centralised training (Base) and MediVault federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum ACC achieved across rounds.

Dataset	Model	Part.	Peers	Final			Best			Mean ± Std
Dataset	Model	Part.	Peers	Base	FL	$Δ$	Base	FL	$Δ$	Base	FL
BC	LINSVM	IID	2	0.982	0.965	−0.018	0.982	0.974	−0.009	0.977 ± 0.007	0.969 ± 0.008
BC	LINSVM	IID	5	0.982	0.974	−0.009	0.982	0.974	−0.009	0.977 ± 0.007	0.972 ± 0.006
BC	LINSVM	Non	2	0.982	0.956	−0.026	0.982	0.965	−0.018	0.977 ± 0.007	0.951 ± 0.013
BC	LINSVM	Non	5	0.982	0.974	−0.009	0.982	0.974	−0.009	0.977 ± 0.007	0.969 ± 0.009
BC	LOGREG	IID	2	0.986	0.986	+0.000	0.986	0.986	+0.000	0.981 ± 0.008	0.976 ± 0.011
BC	LOGREG	IID	5	0.986	0.979	−0.007	0.986	0.979	−0.007	0.981 ± 0.008	0.967 ± 0.012
BC	LOGREG	Non	2	0.986	0.979	−0.007	0.986	0.979	−0.007	0.981 ± 0.008	0.974 ± 0.010
BC	LOGREG	Non	5	0.986	0.979	−0.007	0.986	0.979	−0.007	0.981 ± 0.008	0.965 ± 0.015
HD	LINSVM	IID	2	0.803	0.803	+0.000	0.836	0.820	−0.016	0.812 ± 0.016	0.805 ± 0.006
HD	LINSVM	IID	5	0.803	0.787	−0.016	0.836	0.820	−0.016	0.812 ± 0.016	0.797 ± 0.010
HD	LINSVM	Non	2	0.803	0.836	+0.033	0.836	0.852	+0.016	0.812 ± 0.016	0.833 ± 0.011
HD	LINSVM	Non	5	0.803	0.803	+0.000	0.836	0.820	−0.016	0.812 ± 0.016	0.802 ± 0.008
HD	LOGREG	IID	2	0.853	0.868	+0.015	0.882	0.882	+0.000	0.864 ± 0.008	0.867 ± 0.006
HD	LOGREG	IID	5	0.853	0.868	+0.015	0.882	0.882	+0.000	0.864 ± 0.008	0.869 ± 0.006
HD	LOGREG	Non	2	0.853	0.838	−0.015	0.882	0.882	+0.000	0.864 ± 0.008	0.855 ± 0.018
HD	LOGREG	Non	5	0.853	0.838	−0.015	0.882	0.882	+0.000	0.864 ± 0.008	0.864 ± 0.012

Table 2. Area Under the ROC Curve (AUC) comparison between centralised training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum AUC achieved across rounds.

Table 2. Area Under the ROC Curve (AUC) comparison between centralised training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum AUC achieved across rounds.

Dataset	Model	Part.	Peers	Final			Best			Mean ± Std
Dataset	Model	Part.	Peers	Base	FL	$Δ$	Base	FL	$Δ$	Base	FL
BC	LINSVM	IID	2	0.999	0.998	−0.001	0.999	0.998	−0.001	0.999 ± 0.001	0.998 ± 0.001
BC	LINSVM	IID	5	0.999	0.999	−0.001	0.999	0.999	−0.001	0.999 ± 0.001	0.999 ± 0.001
BC	LINSVM	Non	2	0.999	0.998	−0.002	0.999	0.998	−0.001	0.999 ± 0.001	0.998 ± 0.001
BC	LINSVM	Non	5	0.999	0.997	−0.002	0.999	0.999	−0.001	0.999 ± 0.001	0.998 ± 0.002
BC	LOGREG	IID	2	1.000	0.999	−0.001	1.000	0.999	−0.001	1.000 ± 0.000	0.998 ± 0.002
BC	LOGREG	IID	5	1.000	0.998	−0.001	1.000	0.998	−0.001	1.000 ± 0.000	0.996 ± 0.003
BC	LOGREG	Non	2	1.000	0.999	−0.001	1.000	0.999	−0.001	1.000 ± 0.000	0.997 ± 0.002
BC	LOGREG	Non	5	1.000	0.998	−0.001	1.000	0.998	−0.001	1.000 ± 0.000	0.995 ± 0.003
HD	LINSVM	IID	2	0.864	0.881	+0.017	0.915	0.916	+0.001	0.894 ± 0.018	0.903 ± 0.013
HD	LINSVM	IID	5	0.864	0.878	+0.014	0.915	0.916	+0.001	0.894 ± 0.018	0.897 ± 0.012
HD	LINSVM	Non	2	0.864	0.866	+0.002	0.915	0.900	−0.015	0.894 ± 0.018	0.878 ± 0.021
HD	LINSVM	Non	5	0.864	0.881	+0.017	0.915	0.916	+0.001	0.894 ± 0.018	0.900 ± 0.014
HD	LOGREG	IID	2	0.911	0.915	+0.004	0.924	0.926	+0.003	0.912 ± 0.008	0.918 ± 0.004
HD	LOGREG	IID	5	0.911	0.920	+0.009	0.924	0.931	+0.007	0.912 ± 0.008	0.920 ± 0.004
HD	LOGREG	Non	2	0.911	0.917	+0.006	0.924	0.931	+0.007	0.912 ± 0.008	0.917 ± 0.007
HD	LOGREG	Non	5	0.911	0.920	+0.009	0.924	0.931	+0.007	0.912 ± 0.008	0.920 ± 0.004

Table 3. F1-score (F1) comparison between centralised training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum F1 achieved across rounds.

Table 3. F1-score (F1) comparison between centralised training (Base) and federated learning (FL). BC = breast_cancer; HD = heart_disease; Non = Non-IID.

Δ

denotes FL minus Base. “Best” is the maximum F1 achieved across rounds.

Dataset	Model	Part.	Peers	Final			Best			Mean ± Std
Dataset	Model	Part.	Peers	Base	FL	$Δ$	Base	FL	$Δ$	Base	FL
BC	LINSVM	IID	2	0.986	0.972	−0.014	0.986	0.978	−0.008	0.981 ± 0.007	0.976 ± 0.007
BC	LINSVM	IID	5	0.986	0.979	−0.007	0.986	0.979	−0.007	0.981 ± 0.007	0.978 ± 0.005
BC	LINSVM	Non	2	0.986	0.964	−0.022	0.986	0.970	−0.016	0.981 ± 0.007	0.959 ± 0.012
BC	LINSVM	Non	5	0.986	0.979	−0.007	0.986	0.979	−0.007	0.981 ± 0.007	0.976 ± 0.009
BC	LOGREG	IID	2	0.989	0.989	+0.000	0.989	0.989	+0.000	0.982 ± 0.009	0.981 ± 0.009
BC	LOGREG	IID	5	0.989	0.983	−0.006	0.989	0.983	−0.006	0.982 ± 0.009	0.972 ± 0.010
BC	LOGREG	Non	2	0.989	0.983	−0.006	0.989	0.983	−0.006	0.982 ± 0.009	0.980 ± 0.008
BC	LOGREG	Non	5	0.989	0.983	−0.006	0.989	0.983	−0.006	0.982 ± 0.009	0.971 ± 0.012
HD	LINSVM	IID	2	0.824	0.824	+0.000	0.849	0.836	−0.013	0.825 ± 0.010	0.824 ± 0.006
HD	LINSVM	IID	5	0.824	0.812	−0.012	0.849	0.836	−0.013	0.825 ± 0.010	0.814 ± 0.010
HD	LINSVM	Non	2	0.824	0.848	+0.025	0.849	0.862	+0.013	0.825 ± 0.010	0.845 ± 0.011
HD	LINSVM	Non	5	0.824	0.824	+0.000	0.849	0.836	−0.013	0.825 ± 0.010	0.823 ± 0.008
HD	LOGREG	IID	2	0.839	0.847	+0.009	0.867	0.867	+0.000	0.844 ± 0.009	0.848 ± 0.007
HD	LOGREG	IID	5	0.839	0.852	+0.014	0.867	0.867	+0.000	0.844 ± 0.009	0.851 ± 0.007
HD	LOGREG	Non	2	0.839	0.820	−0.019	0.867	0.867	+0.000	0.844 ± 0.009	0.835 ± 0.019
HD	LOGREG	Non	5	0.839	0.847	+0.009	0.867	0.867	+0.000	0.844 ± 0.009	0.847 ± 0.011

Table 4. Additional evaluation using a lightweight MLP under representative 5-peer settings. BC = breast_cancer; HD = heart_disease; Non = Non-IID. Base Final denotes centralised training performance. FL Final denotes the final federated round, while FL Best denotes the best value achieved across FL rounds.

Dataset	Part.	Peers	Base Final			FL Final			FL Best
Dataset	Part.	Peers	ACC	AUC	F1	ACC	AUC	F1	ACC	AUC	F1
BC	IID	5	0.974	0.997	0.979	0.982	0.999	0.986	0.982	0.999	0.986
BC	Non	5	0.974	0.997	0.979	0.974	0.998	0.979	0.982	0.999	0.986
HD	IID	5	0.820	0.891	0.807	0.885	0.944	0.873	0.918	0.958	0.912
HD	Non	5	0.820	0.891	0.807	0.902	0.951	0.897	0.918	0.962	0.912

Table 5. Prototype -level overhead of protected update exchange using Paillier-style additive homomorphic encryption. Dim. denotes update dimensionality. Plain KB and Enc. KB denote plaintext and encrypted payload size per peer. Enc. ms, Agg. ms, and Dec. ms denote mean encryption time per peer, ciphertext aggregation time, and decryption time, respectively. BC = breast_cancer; HD = heart_disease.

Dataset	Model	Peers	Dim.	Plain KB	Enc. KB	Enc. ms	Agg. ms	Dec. ms
BC	LINSVM	5	31	0.24	9.42	259.3	1.12	76.1
BC	LOGREG	5	31	0.24	9.42	258.2	1.07	73.1
BC	MLP	5	1025	8.01	285.28	8679.2	35.66	2490.7
HD	LINSVM	5	15	0.12	4.98	124.7	0.51	35.6
HD	LOGREG	5	15	0.12	4.98	125.3	0.50	36.0
HD	MLP	5	513	4.01	143.14	4367.7	18.60	1236.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Adeel, U.; Akram, M.S. MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration. Algorithms 2026, 19, 427. https://doi.org/10.3390/a19060427

AMA Style

Li J, Adeel U, Akram MS. MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration. Algorithms. 2026; 19(6):427. https://doi.org/10.3390/a19060427

Chicago/Turabian Style

Li, Jie, Usman Adeel, and Muhammad Safwan Akram. 2026. "MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration" Algorithms 19, no. 6: 427. https://doi.org/10.3390/a19060427

APA Style

Li, J., Adeel, U., & Akram, M. S. (2026). MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration. Algorithms, 19(6), 427. https://doi.org/10.3390/a19060427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MediVault: An Auditable and Secure Federated Learning System for Privacy-Preserving Healthcare Collaboration

Abstract

1. Introduction

2. Background

2.1. Federated Learning in Healthcare

2.2. Update Confidentiality and Secure Aggregation

2.3. Auditability, Governance Evidence, and Trust in Cross-Organisation Collaboration

2.4. Related Work

3. Proposed System

3.1. System Architecture

3.2. Threat Model and Security Scope

3.3. Federated Learning Workflow

3.4. HE-Based Update Protection for Encrypted Aggregation

3.5. SMPC-Inspired Secure Aggregation via Additive Masking

4. Evaluation

4.1. Implementation and Dashboard Views

4.2. Experimental Setup

4.2.1. Models, Baselines, and FL Setting

4.2.2. Peer Partitions (IID vs. Non-IID)

4.2.3. Metrics and Reporting Protocol

4.2.4. Implementation and Reproducibility Details

4.2.5. Secure Update Confidentiality and Secure Aggregation

4.3. Auditability and Governance Evidence

4.4. Results: Performance Comparison (Centralised vs. Federated)

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI