A Federated Generalized Linear Model for Privacy-Preserving Analysis

: In the last few years, federated learning (FL) has emerged as a novel alternative for analyzing data spread across different parties without needing to centralize them. In order to increase the adoption of FL, there is a need to develop more algorithms that can be deployed under this novel privacy-preserving paradigm. In this paper, we present our federated generalized linear model (GLM) for horizontally partitioned data. It allows generating models of different families (linear, Poisson, logistic) without disclosing privacy-sensitive individual records. We describe its algorithm (which can be implemented in the user’s platform of choice) and compare the obtained federated models against their centralized counterpart, which were mathematically equivalent. We also validated their execution time with increasing numbers of records and involved parties. We show that our federated GLM is accurate enough to be used for the privacy-preserving analysis of horizontally partitioned data in real-life scenarios. Further development of this type of algorithm has the potential to make FL a much more common practice among researchers.


Introduction 1.Federated Learning
In the past few years, there has been a surge in the amount of potential data available for gathering [1].Machine learning (ML) and artificial intelligence (AI) have leveraged this and have allowed the development of numerous analytic tools across many different industries, such as healthcare, banking, and manufacturing, among many others [2][3][4][5].
Due to the inherent complex nature of systems and processes, data are usually fragmented in silos across parties.For example, parties could have collected different data features for the same group of individuals (i.e., vertically partitioned data).Alternatively, parties could have gathered the same features but for a different group of individuals (i.e., horizontally partitioned data) [6]-the latter being the focus of this work.
In the traditional way of working, data fragmentation is solved by pooling the data from all parties into a single location.In other words, each party generates a copy of its data.Then, these copies are brought together and centralized by a trusted party, which then proceeds with analyzing the data and obtaining a global model.Unfortunately, this centralized approach is undesirable due to several operational, organizational, and political challenges.For example, once a copy of the data is created and is shared outside its point of origin, is very hard to keep control of it.Sometimes, the costs of integrating huge amounts of data scattered across different parties can make centralization infeasible [7].More importantly, there are increasing privacy and security concerns and requirements that make merging the data in a single point infeasible.Users are more and more concerned that their private data are being used for commercial or political purposes without their consent [8].Moreover, regulatory bodies all around the world have started implementing laws that regulate responsible data management and use, such as the California Consumer Privacy Act (CCPA, 2020) in California, USA and the General Data Protection Regulation (GDPR, 2018) in Europe, among many others [9][10][11].
Federated learning (FL) has emerged as an alternative paradigm to overcome these shortcomings [12].In this approach, the process of generating the global model is distributed among the parties.Instead of sharing their data, the involved parties perform computations on them, generating aggregated statistics (often encrypted) that are then shared to generate the global model.This keeps the original data undisclosed and safe at their original location, greatly reducing the risk of leaking privacy-sensitive information while generating global models that are very close (if not practically identical) to their centralized counterparts [13].

Generalized Linear Models
The term generalized linear model (GLM) refers to a large class of models popularized by McCullagh and Nelder [24].In these models, the response variable y i is assumed to follow an exponential family distribution with mean µ i , which is assumed to be some (often non-linear) function of x T i β.There are three components to any GLM.First, there is the random component, which describes the probability distribution of the response variable y.We will consider only cases in which the observations come from a distribution in the exponential family with probability density function as given by Equation (1): Here, θ is the canonical parameter (such that E(y) = µ = b (θ) and Var(y) = a(φ)b (θ)).It is straightforward to show that the canonical parameter for y ∼ N(µ, σ 2 ) is θ = µ, and the canonical parameter for y ∼ Bin(n, π) is θ = logit(π) = log π 1−π .Secondly, there is a systematic component, which defines how the linear combination of the explanatory variables x = (x 1 , x 2 , . . ., x k ) define the linear predictor Equation (2), where β must be estimated: Lastly, there is a link function g(•), which specifies the link between the random and the systematic components (depending on how the mean function is expressed).The most commonly used link function for a normal model is η = µ, while for a binomial model, it is η = logit(π).Note that whenever η = g(µ) = θ, we say that the model has a canonical link.

Estimation of a Centralized GLM
In order to estimate a GLM, we need to calculate the maximum likelihood estimation (MLE) for β.Using the canonical link η = θ, the log likelihood can be written as in Equation (3): To find the MLE, we use Fisher's scoring algorithm for which the generic (t + 1)-th step can be calculated using Equation (4): where l and l are the first and second derivative of the log-likelihood, which are given by Equation ( 5) and Equation ( 6), respectively: where x ij and x ik are the j-th and k-th element of the covariate vector for the i-th observation.
Using algebra and matrix notation, we can rewrite them as Equations ( 7) and ( 8): where With this, the Fisher Scoring iteration of Equation ( 4) can be rewritten as Equation ( 9): and considering that Xβ = η and A = W δη δµ , we can rewrite it as Equation (10): where z = η + δη δµ (y − µ).This way, Fisher Scoring can be regarded as Iteratively Reweighted Least Squares (IRWLS) carried out on a transformed version of the response variable.The IRWLS algorithm can be described as in Algorithm 1, where g(•) is the link function, ∆g = δµ δη is the derivative of the inverse-link function g (•) with respect to the linear predictor, and w = w 1 , . . ., w n are arbitrary weights assigned to the units (which equal to 1 by default).
In this paper, we present our federated implementation of a GLM for horizontally partitioned data.The manuscript is organized as follows.Section 2 describes the algorithm of the presented federated GLM in detail, how it was implemented, and its validation process for both accuracy and execution time.These results are shown in Section 3, where they are also discussed in detail.We also show examples where our federated GLM is being used as well as discuss possible improvements for it.Section 4 closes the paper with our overall conclusions.

Materials and Methods
We assume that all the involved parties have previously agreed on what variables will be used as an input to the model as well as the variable to be predicted, and they have consistently harmonized their data.

Setup
Our presented federated GLM algorithm is designed to run on a server-client architecture [6], as shown in Figure 1.In this scenario, (1) the user sends a task (i.e., the instruction to run a specific algorithm) to the server.When received, (2) the server manages and synchronizes the execution of the algorithm across all nodes (i.e., parties).In short, (3) each node accesses its own local data and executes the requested algorithm with the given parameters.Afterwards, the node outputs a set of intermediate results (often in the form of preliminary coefficients), which are sent back to the server.Then, using these, (4) the server computes a first version of the global solution and sends it back to the nodes, which use it to compute a new set of results.This process is repeated iteratively until (5) the global solution converges or after a fixed number of iterations, yielding the final version of the (federated) model.

Algorithm for a Federated GLM
Here, we describe the algorithm of a federated version of the GLM model.The main idea behind it is that the components of Equation ( 10) can be (partially) computed at each party k and put together afterwards without ever bringing the data together.
Let us consider M ≥ 2 parties (e.g., data registries, hospitals, banks, etc.) holding an exclusive partition of the full dataset.Let us denote by n m the number of observations in the m-th data source such that the total sample size of the study is given by n The first and the expected second derivatives of Equations ( 5) and ( 6) can be rewritten as in Equations ( 11) and ( 12), respectively: Therefore, using the matrix form, we can rewrite Equation (10) as Equation ( 13): where X (m) is the (n m × K)-matrix of covariates for party m, W (m) is the correspondent K-dimensional diagonal matrix of weights, and z m is the n m -vector of adjusted dependent variable of the m-partition.The federated algorithm is described in Algorithm 2. Note that the proposed algorithm is mathematically equivalent to the one used for centralized data (Section 1.2.1), but it does not require the data to be pooled together.
The federated GLM was implemented in R v. 4.1.3[25].Its code as well as the code used for its validation (Section 2.3) is publicly available in our GitHub repository (accessed 14 June 2022).

Validation
We validated two important aspects of our federated GLM: accuracy and execution time.For both cases, we used artificial data generated using Python v. 3.8 [26].The rest of the validation was completed using R v. 4.1.3[25] and was performed in a laptop running Windows 10 ® (64 bit) with an Intel ® Core i7 CPU running at 1.8 GHz and 32 GB of RAM.

Accuracy
In order to demonstrate the accuracy of our federated GLM, we generated three models, each from a different family (with its corresponding linking function) and with an appropriate dataset.These are summarized as follows.

Initialization Server
Var(µ (m) ) 4: compute and return to Server update return to Nodes

Linear Regression
This model assumes a Gaussian distribution of the error and uses an identity link function.Notice that this particular case corresponds to that of a general linear model (not to be confused with a GLM; in other words, a general linear model is a specific case of a GLM).The target variable y was generated according to Equation ( 14): where x 2 =∼ N (2, 1) ( 16)

Poisson Regression
This model assumes a Poisson distribution of the error and uses a log link function.The target variable y was generated according to Equation ( 18): y = round(e 0.25x 1 +0.5x 2 +ζ ) (18) where x 1 , x 2 , and ζ were defined the same way as in the previous model (Equation (15), Equation ( 16) and Equation ( 17), respectively).

Logistic Regression
This model assumes a binomial distribution and uses a logit link function.In this case, the data were generated using scikit-learn v. 1.0.2[27].They consisted of normally distributed clusters of points (with a standard deviation of 1) around vertices of a 2D plane (since we chose to use two features, x 1 and x 2 ) [28].
In all cases validating the federated GLM's accuracy, the data comprised 3000 records, which were randomly split into three simulated parties with 1000 records each.The federated GLM was run using a mock server-client architecture.This was available from our previously developed open-source priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight Exchange-vantage6 [29].In short, it recreated a server and the nodes of all three parties locally (which is sufficient for verifying the mathematical implementation of the algorithm), making the development and testing very practical.We set the algorithm to stop its execution if it converged (i.e., if the difference between the coefficients across iterations was smaller than 1 × 10 −8 ) or if it reached 50 iterations (the latter condition was never reached; the algorithm converged in all the presented cases).
Afterwards, we compared the federated GLM output to their centralized counterpart, which can be considered the gold standard.Here, the data were all put together as if they were all available to a single party.Then, a (centralized) GLM was generated using R's glm function from the stats package [25].Specifically, we compared the federated and centralized models' coefficients, standard errors, p-, and z-values.Tables were generated using the package stargazer v. 5.2.3 [30].

Execution Time
We were also interested in validating how the federated GLM's execution time escalates under different circumstances.For this purpose, we ran numerous simulations.All of them used the same mock server-client architecture and parameters described earlier.
First, we explored the impact of increasing the number of records while keeping the number of parties constant.Using the same corresponding data for each family as before, we simulated a total of 30, 300, . .., 3,000,000 records in a three-party scenario (i.e., each party had 10, 100, . .., 1,000,000 records).
Afterwards, we explored the impact of increasing the number of parties while keeping the number of records constant.Using the same setup as before, we simulated scenarios with 2, 3, . . ., 10 parties with a total of 10,000 records.Records were distributed randomly and practically evenly among parties (e.g., in a three-party scenario, two parties would have 3333 records, while the remaining one would have 3334).
In both of these validations, in order to obtain a proper idea of the performance and to account for the variability due to inherent randomness in the data generation process, each case was simulated 100 times.We only measured the time (in seconds) from the beginning to the end of the federated GLM execution.In other words, we discarded the time it took to generate the data, since in a real-life situation, each party would already have their own data at hand.

Results and Discussion
Table 1 shows a comparison between the results of the centralized and the federated GLMs.We can see that in all cases, the output of both types of models is practically identical.This demonstrates that our implementation of a federated GLM is capable of generating an equivalent model of that using a centralized approach but without the need of sharing any data among the involved parties.After this, we were confident that the results of federated GLM were practically identical to those of its centralized counterpart.Thus, we proceeded to validate its execution time.
Figure 2 shows the execution time as a function of the total number of records in a fixed three-party scenario.For all three families, we can see that the execution time remains relatively constant from 30 until 30,000 records.In this case, the execution time of the linear model is ∼0.5 s, while for both logistic and Poisson families, it is ∼5 s.However, there is a large increase after that, with the linear family having approximately a 10× increase reaching ∼5 s and the logistic and Poisson families having approximately a 6× increase reaching ∼30 s.In this case, the number of parties was kept constant (at 3), while the total number of records went from 30 to 3,000,000 with increasing order of magnitude.Thus, the scale of the x-axis is logarithmic.Each scenario was simulated 100 times.Data points represent the mean of the execution time, while error bars represent ±1 standard deviation.The lines of each family were slightly shifted along the x-axis to avoid overlap between them.
Figure 3 shows the execution time as a function of the number of parties while keeping the total number of records constant (at 10,000).We can see that for all three families, there is a constant increase in execution time as a function of the number of parties.This makes sense, since a larger number of parties implies a larger number of communications between them and the server for each iteration of the algorithm.The linear family still remains as the fastest model by far, which is followed by the logistic and Poisson families.We should mention that the execution times shown here are just indicative and can only provide an idea of how the algorithm escalates with increasing numbers of records/parties.However, these times will very likely be different when used in a real-life scenario due to a variety of reasons.First of all, this study was performed using a mock server-client architecture, which simulates a server and the nodes in a local environment.In this setup, the server-node communication is very efficient, since there is practically no overhead caused by network operations.In real life, said overhead is very likely to be larger due to low internet connection speeds, varying network infrastructures, etc., yielding a slower execution [31].Moreover, the data used for the validation were generated artificially with relatively simple relations between variables.Real-life data can be much more complex, which can cause the algorithm to take a larger number of iterations to converge (with the possibility of not converging at all).
There are several ways that the presented federated GLM algorithm can be used in real-life analyses.If the parties have an FL infrastructure up and running, they could either implement it from scratch according to their particular needs (based on the description given on Section 2.2) or they could use our provided implementation.As mentioned earlier, this was completed in R.However, it can be used in either R or Python through the provided wrappers.If the parties do not have an FL infrastructure, the easiest way is to use it as part of vantage6.An exhaustive description of the platform is given by Moncada-Torres et al. [29] and Smits et al. [32], while its documentation can be found on its website https://vantage6.ai/(accessed 14 June 2022).
Our federated GLM is already being used in real-life analyses.For example, it has been applied by Wenzel et al. as a logistic regression to identify women with early stage cervical cancer at low risk of lymph node metastasis.This subpopulation of women is very specific and the number of incidences tends to be quite low, making drawing conclusions from a small patient cohort difficult.In said study, the authors have used our federated GLM to generate a logistic regression that uses data from three different parties (namely, cancer registries across three different European countries).This way, they increased the number of patients for their analysis and generated a more powerful model that supports identifying women with early stage cervical cancer with a low risk of lymph node metastasis, allowing for a more conservative treatment [33,34].In another interesting use case, Hamersma used our federated GLM as a basis for performing stratified propensity score matching [35] (which in turn reduced confounding bias by indication) between breast cancer subpopulations of two international cancer registries.Afterwards, the authors compared quality indicators between them-all of it without having to pool data together [36].Needless to say, these are only a couple of examples where our presented federated GLM has been used.Its applications can go well beyond oncology or health care, after all.
There are a few aspects of our federated GLM that could be improved.For example, decreasing communication overhead is known to be a bottleneck for all FL applications [37].Communication could be more efficient using parallelism in each training round [23] or by actively managing parties' contributions based on the status of their conditions (e.g., network speeds, time required for local updates) [38].Model update time could also be decreased by transmitting only part of the updated local model by the parties [39].

Conclusions
In this paper, we presented our federated GLM, which allows generating models of different families (linear, Poisson, logistic) in a privacy-preserving manner.The algorithm can be implemented in the platform of choice of the user or it can be utilized out-of-the-box using the provided implementation in our infrastructure for FL, vantage6.We validated its performance by comparing it with its centralized counterpart, which can be considered the gold standard.Given the mathematical equivalence of the two algorithms, our federated GLM reproduced outputs that are practically identical to those obtained when all the data were pooled together.We also validated its execution time as a function of the number of records and number of parties.Both validations demonstrated the usability of our federated GLM for analyzing horizontally partitioned data without disclosing information at a record level.Further development of this type of algorithm has the potential of making privacy-preserving analyses methods, such as FL, a much more common practice among researchers.

Figure 1 .
Figure 1.Required server-client architecture needed for executing the presented federated GLM algorithm.Notice that the data never leave their corresponding party.

Figure 2 .
Figure 2. Execution time for the different families (linear, Poisson, logistic) of the GLM.In this case, the number of parties was kept constant (at 3), while the total number of records went from 30 to 3,000,000 with increasing order of magnitude.Thus, the scale of the x-axis is logarithmic.Each scenario was simulated 100 times.Data points represent the mean of the execution time, while error bars represent ±1 standard deviation.The lines of each family were slightly shifted along the x-axis to avoid overlap between them.

Figure 3 .
Figure 3. Execution time for the different families (linear, Poisson, logistic) of the GLM.In this case, the number of records was kept constant (at 10,000), while the number of parties was 2, 3, . . ., 10.Each scenario was simulated 100 times.Data points represent the mean of the execution time, while error bars represent ±1 standard deviation.The lines of each family were slightly shifted along the x-axis to avoid overlap between them.