2. Related Work
The increasing awareness of privacy risks in FL posed by GIA has led to the development of various defence mechanisms. This section reviews existing approaches aimed at mitigating GIA attacks with an emphasis on techniques based on differential privacy, homomorphic encryption, secure multi-party computation and masking methods due to their strong theoretical guarantees, extensive use in recent literature and relevance to the threat model addressed in this work. Though other emerging approaches exist, the mentioned techniques represent the dominant lines of defence currently explored and serve as a basis for comparing the proposed method.
Differential privacy (DP) [
14] provides a formal framework for quantifying and preserving the privacy of individual data points within a dataset. In general, DP ensures that the inclusion or exclusion of a single data point does not significantly affect the output of a computation. In FL settings, DP is typically applied by adding noise to the gradients in HFL or to the intermediate results in VFL and CFL before they are sent to the server or other participants [
15]. This noise addition helps mask the exact contribution of each client’s data making it much harder for attacks like GIA to accurately reconstruct sensitive information. For HFL, an adaptive FL framework combining DP mechanisms has been proposed to enhance both the communication efficiency and privacy preservation of FL systems [
16]. This approach integrates an adaptive learning rate strategy to improve model convergence and robustness under varying hyperparameter settings. Experimental results demonstrated that the adaptive strategy not only enhances convergence under limited communication budgets but also improves the model’s resilience against noise across DP budgets. In [
17], an adaptive DP mechanism is designed to address two key limitations of DP in VFL: the uniform treatment of features and privacy budgets. It dynamically allocates privacy budgets based on the impact of each organisation’s features on the global model, as well as adjusting the budget throughout the training process to balance privacy and utility. Experimental results demonstrated that this method improves privacy by reducing feature inference attacks by 25% and enhances model accuracy by 15% compared to traditional budget allocation methods. DP ensures strong privacy protection but introduces a trade-off in terms of model performance. Adding noise to gradients or intermediate results can degrade accuracy and slow convergence compared to scenarios without noise. A balance between privacy and performance must be achieved as insufficient noise may expose sensitive data and excessive noise can reduce model utility.
Homomorphic encryption (HE) [
18] is a form of encryption that allows computations to be performed on encrypted data without the need to decrypt it first. This property ensures that sensitive data remains private during processing. The results of computations on the encrypted data can be decrypted to reveal the final outcome; however, the data itself is never exposed in an unencrypted form. In the context of FL, HE can be used to protect the privacy of the data being processed [
19]. When a client in an FL setup trains a model using its local data, it can encrypt its gradients before sending them to the server. The server can then perform necessary aggregation on the encrypted updates without accessing the actual gradients. Once the aggregation is complete, the result is sent back to the client, where it can be decrypted to obtain the final model update. A privacy-preserving FL system based on HE is proposed in [
20] to protect shared model parameters. The paper introduced a method for using different HE private keys for each node within the same FL system which further enhanced security. The proposed solution’s computational and communication costs are analysed, and the algorithm’s performance is evaluated through simulations in cloud computing-based FL scenarios. The proposed system maintained the privacy of sensitive data by enabling encryption on the gradients. On the other hand, implementing HE comes with challenges. The encryption process can be computationally intensive and performing operations on encrypted data can be much slower compared to working with plaintext data. Therefore, HE introduces a trade-off between privacy and performance, and careful consideration of the associated computational overhead is required for its implementation.
Secure Multi-Party Computation (SMPC) [
21] is a cryptographic technique that allows multiple parties to collaboratively compute a function over their inputs while keeping those inputs private. In SMPC, no party learns anything about the other parties’ inputs except for the output of the computation. This is achieved through protocols (including HE) that ensure the privacy of the data and allow computations to be carried out securely despite the participation of potentially untrusted parties. In [
22], SMPC-based FL is proposed to improve privacy and efficiency. The algorithm enhanced security by eliminating the need for a trusted third party and mitigating the risk of key disclosure making it well-suited for FL environments. It also addressed participant dropouts by encrypting model parameters and ensuring secure aggregation. The algorithm used compressed sensing to reduce communication and computation overhead while maintaining strong privacy protection. Experimental results showed that it outperformed existing approaches in efficiency. However, the paper acknowledged that computational efficiency can be improved and the local operations required for privacy preservation impose a significant load on participants.
Masking [
23] is another technique that could be used to obscure sensitive data by replacing it with random noise or “masks” making the data unreadable to unauthorised parties. In FL, the participating clients add carefully constructed random masks to the gradients before transferring them to the server (similar to DP, where random noise is added). This addition of masks helps the FL framework resist the GIA on the server. The masks used by the clients are generated in such a way that their sum is equal to zero. Therefore, unlike DP, there is no loss of accuracy in using masking in FL. An efficient secure aggregation protocol based on masking is proposed in [
24]. Besides the server, that is responsible for the aggregation of gradients, there is another third party, whose sole responsibility is to generate random masks and distribute them to the participating clients. The participating clients add these masks to the gradients before sending them to the server. The third party generates random masks in each round of training of FL which could add to the computational and communication complexity of the overall system.
Beyond the four techniques discussed above, other approaches have received growing attention in the privacy-preserving FL literature. Split learning with privacy mechanisms [
25] partitions the neural network at a cut layer and protects intermediate activations using differential privacy or masking, though it does not directly address the vertical data partitioning setting. Knowledge distillation-based methods such as FedKD [
26] and the complementary distillation approach for VFL proposed by Gao et al. [
27] transfer knowledge via model outputs rather than gradients, offering a complementary form of privacy protection with a different threat surface. Recent advances in secure VFL include SecureVFL [
28], which uses blockchain and replicated secret sharing for decentralised multi-party VFL and the end-to-end framework of Jin et al. [
29], which combines multi-server SMPC with differential privacy to cover both input and output privacy.
The proposed algorithms of this work offer significant advantages over existing privacy-preserving techniques. Unlike DP, there is no loss in performance ensuring that model accuracy and convergence are maintained. In contrast to HE, the proposed approach avoids high computational complexity making it more efficient. Unlike SMPC, the clients only need to communicate directly with the server eliminating the need for complex communication channels. Finally, unlike masking, there is no requirement for a third party to generate and distribute masks simplifying the process and reducing reliance on external parties.
Table 1 summarises and compares key privacy-preserving techniques in FL across multiple dimensions such as accuracy impact, computational cost, communication overhead and reliance on third parties.
The criteria underlying the qualitative ratings in
Table 1 are defined as follows. Accuracy impact is rated
None if the technique introduces no degradation relative to centralised training and
Medium if a measurable accuracy loss is introduced; DP is rated Medium because noise addition to gradients or intermediate results provably degrades model utility. Computational cost is rated relative to unprotected baseline training: HE is rated
High due to the multiplicative overhead of homomorphic operations, while masking, DP and the proposed approach are rated
Low. Communication overhead is rated
High for HE because encrypted ciphertexts are substantially larger than plaintext tensors; all other techniques transmit values of the same dimensionality as the unprotected baseline and are rated
Low. Third-party required is rated
Yes only for the masking baseline, which relies on a dedicated external dealer for mask generation and distribution each round; the proposed approach eliminates this requirement by delegating mask generation to the V-client.
4. Privacy Analyses
4.1. Threat Model
We adopt the honest-but-curious (also referred to as semi-honest) adversarial model throughout this work. In this model, every participating party, the server or any client, follows the prescribed protocol faithfully at every step, but may attempt to infer sensitive information from all messages it legitimately receives during protocol execution.
Formally, the view of an honest-but-curious adversary is defined as the complete transcript of values received by that party during a protocol run. In our setting:
The server’s view consists of the masked intermediate matrices and from all clients, and the encrypted gradient matrices .
An H-client ’s view consists of its own local data , its mask , and the encrypted gradient received from the V-client via the server.
The V-client’s view consists of its own local data , the output feature y, all masks , the symmetric keys , and the non-linear activation outputs received from the server.
All privacy claims and analyses presented in this paper are explicitly bounded to this semi-honest model. The extension to a fully malicious adversarial model, where parties may deviate arbitrarily from the protocol is an important open problem discussed in
Section 5.
The semi-honest assumption is well-justified in the federated settings we target. Participants such as hospitals, insurance companies and national health registries are regulated entities subject to strict legal frameworks including HIPAA [
34] and GDPR [
35]. In such environments, active protocol deviation carries significant legal and reputational consequences, making the assumption of correct protocol execution realistic and well-grounded. The honest-but-curious model is furthermore the standard baseline adopted in the broader federated learning privacy literature [
8] and our analysis is consistent with this convention.
The privacy analyses are conducted to demonstrate that the sharing of intermediate information between different participating parties does not leak local data nor can any party infer sensitive information from the other. The analysis assumes an honest-but-curious server (and makes the same assumption about the clients), meaning they follow the prescribed protocol but may attempt to infer sensitive information from the received messages. In this work, the honest-but-curious server or client is referred to as the adversary.
One of the intermediate pieces of information shared with other parties is the Z matrix (from clients to the server and from clients to other clients). A critical privacy concern is whether the adversary can reconstruct the private data , y or from the received Z matrices. Capturing client-side private information from the observed Z matrices is mathematically infeasible.
Suppose that the adversary attempts to infer
from the observed
. The adversary would need to solve the system:
However, the adversary has no access to either or , both of which are randomly and independently generated on the client side. The presence of the random mask , unknown to the adversary, ensures that appears random and independent of the true value of . Without knowledge of , it is mathematically impossible to isolate from , and hence to retrieve any meaningful information about .
Moreover, even if the mask were absent, the system
would remain underdetermined, provided that
has fewer rows than columns (i.e., the number of neurons in the first hidden layer is less than the number of input features). Therefore, the system has infinitely many solutions for
, formally described by:
where
denotes the Moore–Penrose pseudoinverse [
36] of
and
B is an arbitrary matrix of appropriate dimensions. This further demonstrates that, even in the absence of the mask, unique recovery of
is not possible.
The adversary can use the Moore–Penrose pseudoinverse of
, denoted by
, to obtain a least-squares solution:
This is a particular solution, and it is the minimum-norm solution to the underdetermined system. However, since there are infinitely many solutions to the equation
, the general solution can be expressed as:
where
is any solution that lies in the nullspace (kernel) of
. In other words, it satisfies:
. The solutions in the nullspace can be written as:
Thus, the full general solution is given as Equation (
29). An identical argument holds for the
matrices sent by the H-clients. Therefore, the transmission of the
Z matrices by the clients with masking ensures that the private local data remains secure and unrecoverable by the adversary.
The other important intermediate piece of information shared with other parties is . It will be in encrypted form, sent to the server. Without encryption, the server would have access to excessive information: the Z matrices from all clients and the matrix. The server could then modify the matrix, such as setting certain columns of this matrix to zero before sending it to the other clients. These modifications could expose sensitive details which could lead to leaking local data through the Z matrices of the clients in subsequent rounds compromising privacy. The encryption of ensures that the server cannot manipulate or extract critical information preserving the privacy of the local data.
In the above setting, the server only receives the encrypted ; however, the participating clients receive in unencrypted form. Therefore, it is essential to analyse whether a curious client can infer another client’s local data from the received . To infer the original local data from , it would require solving an underdetermined system as depends on both the shared representations and the unknown local model parameters. Further, since aggregates information across mini-batches and depends on the specific way the local model learns, it would be very difficult for a client to reverse it and recover the original data without strong prior knowledge of the target client’s data distribution and model structure.
The server also sends the results of the non-linear activation to the V-client. The V-client as an adversary tries to extract the matrices of the H-clients that could be used later on in conjunction with . The server uses ReLU activation function, defined as , which is a one-way non-invertible function. However, this non-invertibility is partial: when , the input can be exactly recovered as , allowing a potential attacker to reconstruct those parts of the original signal. For an output , if , then could be any value less than or equal to zero resulting in an ambiguity and the function loses information about the negative part of the input. Formally, there exists no invertible operation that can uniquely recover from , as the inverse is undefined for .
For stronger privacy, alternative activation functions such as SELU [
37] may be considered as they blur information across both positive and negative ranges and do not offer exact inversion on either side. It is defined as:
where
and
.
One of the other privacy issues that could arise is from collusion. Collusion refers to a situation where multiple entities in an FL system (e.g., server and clients) secretly cooperate to infer private data from other parties [
38]. It can significantly weaken privacy guarantees as the combined information from different parties may help uncover masked or intermediate data.
Table 3 summarises the proposed algorithms’ resilience to different types of collusion. For each scenario, we specify the joint view of the colluding coalition and the system of equations they can construct and whether recovery of private data is possible.
4.1.1. Scenario 1: Server Alone Attempts to Infer Private Data (Low Risk)
The server’s view consists of the masked intermediate matrices
and the encrypted gradient matrices
. The server does not have access to any mask
or
, nor to any symmetric key
, nor to any weight matrix
or
. To infer
from
, the server must solve:
However, it has access to neither
nor
. The presence of the unknown random mask
ensures that
is statistically independent of
from the server’s perspective. Even if the mask was absent, the system
is underdetermined whenever
, admitting infinitely many solutions as established in
Section 4. An identical argument applies to each
. Furthermore, the encrypted gradient matrices
are protected by AES encryption under keys unknown to the server, so no information about
is accessible. The server therefore cannot reconstruct any client’s private data.
4.1.2. Scenario 2: Clients Alone Attempt to Infer Each Other’s Data (Low Risk)
Each H-client independently holds its own local data , mask , symmetric key and the decrypted gradient (or in the combined setting). No H-client has access to another H-client’s mask, weight matrix or local data. The V-client holds all masks and keys , but in this scenario is assumed to be honest. For H-client to infer () from the received , it would need to solve an underdetermined system in which depends on the aggregated intermediate representation and on the V-client’s internal model parameters, neither of which is accessible to . Furthermore, aggregates information across all clients and all data points in the mini-batch, making it infeasible to isolate the contribution of any individual client’s data without knowledge of the other clients’ weight matrices and local representations. Each client therefore cannot infer another client’s private data when acting independently.
4.1.3. Scenario 3: V-Client Compromised (High Risk)
The V-client’s view, in addition to its own local data , output feature y and weight matrices includes all masks and all symmetric keys . A compromised V-client can therefore:
Remove masking from H-client transmissions: For each H-client
, the compromised V-client can subtract the known mask
from the observed
:
directly obtaining the weighted data representation
. While this does not immediately yield
, recovery still requires solving an underdetermined system in
and
, both unknown to the V-client, it completely removes the masking layer of protection and exposes the H-client’s data representation to any further inference attack the adversary may mount.
Decrypt all gradient communications: Using the known symmetric keys , the compromised V-client can decrypt any it has forwarded, recovering the plaintext gradient for all H-clients. Combined with knowledge of and the decrypted , the adversary has access to a richer system of equations that may further constrain the recovery of , if the adversary also possesses side knowledge about the data distribution or model initialisation.
This scenario represents a fundamental trust boundary of the proposed framework: the security of all H-clients’ data relies critically on the V-client remaining honest, since the V-client is the sole generator and custodian of all masks and symmetric keys. This limitation is inherent to the design choice of having the V-client manage mask distribution, which was motivated by the goal of eliminating the need for a trusted third party. One mitigation is to distribute the mask generation responsibility to a threshold of clients using a secret sharing scheme [
39] so that no single compromised party can reconstruct all masks. This is identified as an important direction for future work.
4.1.4. Scenario 4: All H-Clients Compromised (High Risk)
If all
J H-clients collude, their joint view includes
. From this joint view, the colluding coalition can compute:
where
is the aggregated matrix computed and broadcast by the server. This gives the colluding coalition direct access to the V-client’s masked intermediate representation
. To recover
from
, the coalition must additionally determine
and
, neither of which is in their view. The system:
remains underdetermined in both
and
simultaneously, with infinitely many joint solutions. However, the coalition also receives
in plaintext, which encodes the gradient of the loss with respect to
and thus carries information about the V-client’s labels
y and its internal model state. Over multiple rounds, the accumulated system of observations,
values and corresponding
values, may provide sufficient constraints to mount a gradient inversion style attack [
8] against the V-client’s data. This scenario therefore constitutes a high privacy risk, particularly under mask reuse (see
Section 3.1.2). The periodic re-masking strategy proposed in
Section 3.1.2 directly mitigates this risk by limiting the number of consistent observations available to the colluding coalition.
4.1.5. Scenario 5: At Least One V-Client and One H-Client Remain Honest (Low Risk)
Suppose the colluding coalition consists of the server and all H-clients except , while the V-client and remain honest. The coalition’s view includes and the server’s aggregated outputs, but does not include or . The coalition knows only the sum , which is computable from the masks they hold. However, knowing the sum does not allow the coalition to determine the individual values of or : there are infinitely many pairs satisfying the same sum, and since the masks are generated as independent random vectors, no pair is more likely than any other from the adversary’s perspective. The system of equations available to the coalition therefore remains underdetermined and the privacy of both the honest V-client and the honest H-client is maintained.
5. Results and Discussions
Our simulations focus on two main evaluation aspects: (i) model performance measured by accuracy and loss across training rounds and (ii) computational time comparing scenarios with and without incorporating encryption and masking mechanisms. The simulations are conducted using the open-source FL tool, FLOWER [
40], on three publicly available datasets to evaluate the proposed algorithms across diverse data characteristics: Pima Diabetes [
41], Skin Segmentation [
42] and Patient Care Data [
43]. The Pima Diabetes dataset comprises 768 data points with 8 numerical features, where the task is to predict binary diabetes status (500 negative, 268 positive). The Skin Segmentation dataset includes 245,057 data points with 3 features to distinguish between skin and non-skin pixels (194,198 skin, 50,859 non-skin). The Patient Care Data consists of 3309 data points and 9 features representing a binary classification task between in-care and out-of-care patients (1992 in-care, 1317 out-of-care). Each dataset undergoes normalisation as a pre-processing step. The simulation code and experimental details are available on the GitHub page (
https://github.com/AustralianCancerDataNetwork/FlowerSimulations/tree/main/SecureFlowerSimulations (accessed on 28 May 2026)).
Although all three datasets involve binary classification tasks, they differ substantially in scale, class balance and feature dimensionality, providing a meaningful range of evaluation conditions. The Pima dataset represents a small-scale, class-imbalanced medical setting (500 negative, 268 positive samples). The Patient Care dataset represents a moderate-scale clinical classification task with a more balanced class distribution (1992 vs. 1317 samples). The Skin Segmentation dataset, at 245,057 samples, is substantially larger and primarily evaluates the scalability of the proposed algorithms. Together, the three datasets span several orders of magnitude in size and provide a reasonable empirical basis for the tabular binary classification setting. All empirical claims in this paper are explicitly bounded to this setting.
Table 4 outlines the distribution of input features, data points, and output features across clients for the three proposed secure FL algorithms applied to the Pima, Patient, and Skin datasets. There is one V-client and two H-clients participated in simulations. In the Secure VFL setup, features were vertically partitioned across clients, with all data points accessible to each client. Only the V-client held the output feature and H-clients contributed complementary subsets of input features. In the Secure V-OutFed configuration, the V-client again possessed the output feature and full data access, but the H-clients held non-overlapping subsets of the data points introducing an unbalanced horizontal partitioning. Secure H-OutFed used the same data partitioning as V-OutFed; however, in this case, the output feature was distributed to the H-clients instead of the V-client. Across all datasets and algorithmic settings, the machine learning model remained consistent: a neural network with two hidden layers, each comprising five neurons and sigmoid activation functions. The training was conducted over 100 rounds using a fixed learning rate of 0.5.
The neural network architecture used throughout this work consists of two hidden layers with five neurons each, sigmoid activations, a learning rate of 0.5 and 100 training rounds was selected through preliminary experimentation conducted prior to the main evaluation. Several configurations were tested on all three datasets; the chosen architecture was the most conservative configuration that achieved stable convergence across all datasets and all three algorithms within a reasonable training budget. Larger architectures did not yield meaningfully different convergence behaviour on these relatively low-dimensional tabular datasets, while smaller architectures showed instability on the larger Skin Segmentation dataset. The chosen configuration therefore represents a stable, dataset-appropriate baseline for the comparative evaluation conducted in this work.
It is important to note that for Secure VFL and Secure V-OutFed, the reported performance is provably architecture-independent by construction. As established in
Section 3, both algorithms produce mathematically identical outputs to centralised training given the same initialisation, regardless of the architecture used. This follows from the algebraic structure of the intermediate representation exchange: the masks cancel exactly in the aggregated
matrix and gradient propagation proceeds identically to the centralised case. Consequently, any architecture that converges under centralised training will yield identical results under Secure VFL and Secure V-OutFed. For Secure H-OutFed, where a small performance gap exists, the competitive results observed across three datasets spanning several orders of magnitude in size suggest that this gap is not an artefact of the specific architecture chosen. A formal architectural change for Secure H-OutFed is acknowledged as a limitation and is identified as a direction for future work.
A systematic sensitivity analysis of the training hyperparameters (learning rate, number of neurons, number of training rounds) is acknowledged as a limitation of the current experimental evaluation. For Secure VFL and Secure V-OutFed, such an analysis is not required as these algorithms are mathematically equivalent to centralised training under any hyperparameter configuration, and sensitivity behaviour transfers directly from the centralised baseline. For Secure H-OutFed, a formal sensitivity analysis is identified as a direction for future work.
Figure 5a–f illustrate the accuracy and loss for the Pima, Skin, and Patient datasets across the proposed secure FL algorithms and a centralised baseline. The performance of Secure VFL and Secure V-OutFed exactly matches that of the centralised model, demonstrating no loss in accuracy or increase in loss despite distributed data and added privacy mechanisms. Secure H-OutFed also achieves competitive performance with slightly higher loss and marginally lower accuracy. The presented results correspond to a specific data distribution scenario as detailed in
Table 4. To evaluate the robustness of the proposed algorithms, additional simulations under a range of data distribution settings were conducted in prior work [
7] including independent and identically distributed (IID) and non-IID partitions. Readers are referred to that study for comprehensive analysis across these variations; results are not repeated here to avoid redundancy.
The proposed secure VFL and secure V-OutFed algorithms demonstrate practical scalability in computation and communication. The computation per H-client involves a local linear transformation of complexity for secure VFL and for secure V-OutFed. The V-client performs more intensive operations: it computes its local representation with cost , followed by forward propagation, loss computation, and backward propagation. If the total number of parameters in the computation layers is p, the overall cost at the V-client becomes . On the server side, aggregation of masked intermediate outputs and relaying gradients involves a cost of , linear in the number of clients J. Communication per client per round is also linear in and m. Therefore, though the framework scales linearly with data dimensions and client count, the V-client and server may experience computational or memory bottlenecks as the network or dataset grows substantially.
The proposed secure H-OutFed algorithm also exhibits practical scalability in computation and communication. The V-client handles its input features locally and transmits masked intermediate outputs to the server. Its computation primarily consists of a local linear transformation with complexity . Each H-client undertakes more computationally intensive tasks including computing local representations at a cost of , followed by forward propagation, loss evaluation and backward propagation. The overall computational complexity at an H-client is given by . On the server side, aggregation of masked gradients or local model parameters received from the H-clients involves a computational cost of . This cost linearly increases with the number of clients and the model size. Communication per client per round is also linear in p, as each client transmits masked gradients or parameters of the same dimensionality. However, similar to the other two algorithms, the server in secure H-OutFed may face computational or memory bottlenecks for large models or high client counts.
Table 5 presents the computation time (in seconds) required by the proposed secure federated learning algorithms across three datasets of varying sizes: Pima (768 data points), Patient (3309 data points) and Skin (245,057 data points). The experiments were conducted on a local Windows 11 machine equipped with a 13th Gen Intel(R) Core(TM) i7-13700H processor (2.90 GHz) and 16 GB of RAM. As expected, the computation time increases with the dataset size and the level of security applied. In the experiments reported in
Table 5, the combined use of encryption and masking increased runtime by approximately 3.5% to 43.9% relative to the unprotected baseline, depending on the dataset and algorithm. The percentage overhead was largest for the smaller Pima dataset, where fixed encryption and protocol costs form a larger share of total runtime, and smallest for the larger Skin Segmentation dataset, where the total computation is dominated by model training. For all three algorithms, the base computation time (i.e., without encryption or masking) is the lowest and the combined use of encryption and masking incurs the highest time cost. In addition, Secure H-OutFed generally exhibits slightly higher computational overhead due to its more complex interaction structure. The increase in time caused by enabling masking or encryption individually is moderate, however, their combined use results in a more noticeable overhead. Despite this, the additional computation cost remains practical even for large datasets like Skin. It is important to note that these values may exhibit minor variations across runs due to randomness in model initialisation and other system-level factors.
Though the proposed secure VFL and CFL algorithms demonstrate promising results in controlled local simulations, deploying them in real-world environments presents several additional challenges. Specifically, the VFL framework without encryption and masking has previously been implemented on the NECTAR cloud platform [
44] with the server hosted on NECTAR and clients distributed across different physical locations. In such distributed deployments, communication overhead becomes significantly more pronounced compared to local machine setups. A key bottleneck lies in the transmission of intermediate representation matrix
Z, whose column size scales with the number of data points. For large-scale datasets,
Z could become exceptionally large, making its secure transmission over the network computationally and logistically intensive. The same issue could be seen in the proposed CFL algorithms (e.g., V-OutFed and H-OutFed), where multiple parties share and aggregate masked or encrypted outputs and gradients. One possible way to tackle this is to transmit compressed representations of the matrix
Z or to aggregate partial summaries instead of full outputs.
The correctness of the proposed algorithms holds mathematically for any data type and any differentiable model architecture, since it depends only on the algebraic structure of the intermediate representation exchange and not on the specific form of the input. In this sense, the framework is architecturally general. However, extending it to imaging data in the VFL (and CFL) setting raises non-trivial design challenges. Since VFL partitions data by feature rather than by sample, applying the framework to image data would require splitting images by pixel regions across clients. While mathematically valid, this does not correspond to a realistic real-world VFL scenario, institutions holding imaging data are far more likely to possess entirely different images than disjoint pixel subsets of the same images, which is a horizontal rather than a vertical partitioning scenario. A more realistic imaging-compatible VFL scenario would involve a multimodal dataset pairing imaging features (e.g., medical scans) held by one client with tabular clinical features held by another client for the same patients. Such an extension would also require replacing the fully connected neural network with an architecture combining convolutional and fully connected layers to handle spatial structure, representing a separate model design contribution. Designing and evaluating the proposed framework in such a multimodal VFL setting is identified as an important direction for future work.
The proposed algorithms demonstrate strong performance in heterogeneous data settings as reflected in their accuracy and loss outcomes discussed earlier. However, practical scalability in such settings also depends on the heterogeneous computational and communication capacities of the participating entities. In secure VFL and secure V-OutFed, the V-client bears the majority of the computational burden: handling forward propagation, loss computation and backpropagation. Therefore, it is important that the V-client possesses sufficient computational power and network bandwidth to prevent bottlenecks. On the other hand, in secure H-OutFed, each H-client is actively involved in local model training and must have adequate resources to remain a feasible participant in training.
This work assumes an honest-but-curious server: one that correctly follows the protocol but may attempt to infer sensitive information. However, if the server turns out to be malicious, there are other risks to consider. A malicious server could still drop or block updates from certain clients compromising their participation. Further, it could send incorrect or inconsistent gradients/updates to different clients, leading to model divergence or performance degradation [
45]. To resist the impact of a potentially malicious server, mechanisms such as verifiable computation or zero-knowledge proofs [
46] can be used, which allow clients to ensure that the server is executing the intended operations without tampering with the updates. Furthermore, another strategy to handle an untrusted or potentially malicious server is to run critical computations within a Trusted Execution Environment (TEE), such as Intel SGX [
47]. TEEs help maintain integrity and confidentiality even when the underlying system cannot be fully trusted.
The current work focuses on tabular data, but the same foundational idea where clients compute and share intermediate results can be extended to other data types, such as images. The underlying framework built around neural networks is also flexible enough to accommodate simpler models like logistic regression. However, there is still room to explore how well this approach generalises to other machine learning models such as decision trees or random forests, which have different computational and structural characteristics. Moreover, exploring unsupervised machine learning models could offer new opportunities and challenges for secure and collaborative model training in future work.