Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification

In bioinformatics, genome-wide association studies (GWAS) are used to detect associations between single-nucleotide polymorphisms (SNPs) and phenotypic traits such as diseases. Significant differences in SNP counts between case and control groups can signal association between variants and phenotypic traits. Most traits are affected by multiple genetic locations. To detect these subtle associations, bioinformaticians need access to more heterogeneous data. Regulatory restrictions in cross-border health data exchange have created a surge in research on privacy-preserving solutions, including secure computing techniques. However, in studies of such scale, one must account for population stratification, as under- and over-representation of sub-populations can lead to spurious associations. We improve on the state of the art of privacy-preserving GWAS methods by showing how to adapt principal component analysis (PCA) with stratification control (EIGENSTRAT), FastPCA, EMMAX and the genomic control algorithm for secure computing. We implement these methods using secure computing techniques—secure multi-party computation (MPC) and trusted execution environments (TEE). Our algorithms are the most complex ones at this scale implemented with MPC. We present performance benchmarks and a security and feasibility trade-off discussion for both techniques.


Introduction
Genome-wide association studies (GWAS) split the cohort of individuals into two or more groups based on a phenotypic trait of interest, e.g., on the level of hemoglobin, on the existence or non-existence of a disease. These groups are compared to each other in the framework of case-control studies to find in the DNA sequence single-nucleotide polymorphisms (SNPs) that are significantly overrepresented in one group. GWAS have been performed extensively to date, and various concerns have been found. At least two of these problems-the existence of polygenic phenotypes, and population stratification-can be alleviated with the use of more heterogenous databases with larger volumes of data.
Polygenic phenotypes are traits that are affected by multiple genetic locations. Only a small subset of these locations are known, and they explain only a small amount of variance. To study them further, bioinformaticians need access to more heterogeneous data in order to detect subtle associations.
Population stratification is most often caused by geographic isolation of subpopulations over several generations [1]. If the case and control groups are compiled carelessly, genetic differentiation caused by population stratification can confound associations between genotype and trait. The resulting false positive or negative associations may, in fact, result from differences in local ancestry and be independent from the phenotypic trait under investigation [2]. An example of allele frequencies differing because of ancestry was shown in a case-control study [3]. In this study, a European American cohort was investigated. They found that a SNP in the lactase gene (LCT) gave a strong statistically significant association (p < 10 −6 ) with height. Both height and the LCT gene have wide variations across populations in Europe. This spurious association was reduced when individuals were rematched on the basis of European ancestry.
While collaboration among biobanks is improving, the construction of datasets that represent continent-wide or worldwide populations pose significant privacy risks. To alleviate these risks, privacy-preserving methods with cryptographic guarantees have been identified as the potential solution. Moreover, the European Data Protection Board (EDPB) has acknowledged certain secure computing technologies (split or multi-party processing) as an effective technical measure for protecting personal data transfers across borders [4]. This comes in light of the recent Court of Justice of the European Union judgment in the Schrems II case. Furthermore, the European Data Protection Supervisor has noted the role of using technologies that enable computation over encrypted data in the context of creating the European Health Data Space [5]. Secure multi-party computation (MPC) is one such privacy enhancing technique that can provide cryptographic guarantees. Two organisational models for GWAS-on-MPC were shown already in [6]. A number of implementations have been proposed since, including [7][8][9][10][11][12][13]. Components of GWAS have also been implemented using fully homomorphic encryption [14,15] and Intel ® Software Guard eXtensions (SGX) trusted execution environments [16][17][18][19].
However, even in big heterogenous datasets, population stratification or over-representation of sub-populations can lead to spurious associations. To counteract these effects, bioinformaticians use estimation methods that propose parameters that represent the ancestry of the cohort. The number of these associated SNPs can greatly vary based on how different or similar the populations are. Principal component analysis (PCA) uses eigenvector decomposition to infer population stratification. The results can be used to visualise genetic relationships between variables alongside reference populations helping identify population stratification. Alternatively, the results can be used to adjust the genotypes and phenotypes for stratification as in [20]. Stand-alone PCA for GWAS using MPC has been demonstrated in [8,10,21]. A survey paper of privacy-preserving techniques for bioinformatics was published in 2017 [22].
In this paper, we show how the state-of-the-art GWAS methods that account for population stratification can be made to be privacy-preserving. That is, we present privacypreserving versions of the following four algorithms: EIGENSTRAT [20], FastPCA [23], EMMAX [24] and genomic control [25]. We redesign the existing widely-used algorithms to use two different privacy enhancing technologies (PETs)-secure multi-party computation and trusted execution environments. The algorithms can be used on multiple protocol backends in various security models. We compare the solutions, implement the algorithms and present benchmark results. For implementation, we use the SHAREMIND MPC [26] and SHAREMIND HI platforms. SHAREMIND HI uses the Intel ® Software Guard eXtensions (SGX) technology to create trusted execution environments and, thus, can be run securely in single server setting while SHAREMIND MPC requires a more complex setup.

Genome-Wide Association Studies
To conduct genome-wide association studies (GWAS), the cohort of individuals is usually split into two (or more) groups-cases and controls-based on a phenotypic trait of interest, e.g., the colour of the eyes. Next, the genotype data of the groups are studied to detect DNA sequence single-nucleotide polymorphisms (SNPs) that are significantly overrepresented in one group. Although all four nucleotides (A, C, G, and T) can be located at a SNP site, usually, only two alleles A and B are considered. The first (A) corresponds to the reference sequence, and the second (B) represents potential mutations. SNP data are stored as pairs of chromosomes (AA, AB, BB, and NN if the measurement could not be completed for some reason) in text format. It is also common to use binary encoding for SNPs, where zero denotes the dominant nucleotide in the specific DNA location and one denotes the remaining three alternatives. As the same genetic information is represented in two chromosomes and common genotyping techniques measure these simultaneously, the resulting measurement outcome can be encoded as {⊥, 0, 1, 2}, where ⊥ encodes measurement failures.
In GWAS, genotype data are usually represented as an n × m matrix X with values {0, 1, 2}, where the rows correspond to individuals and columns to SNP locations. Phenotype data are given as a {0, 1} vector y of length n, where 1 denotes the case group (the existence of the trait(s) in question) and 0 denotes the control group (the lack of the trait(s) in question). The measurement failures are denoted by the n × m mask matrix M with values {0, 1} that indicate whether the corresponding SNP is available or not (0 represents the missing SNP and 1 the available SNP). Table 1 gives an example what this dataset looks like.

Algorithms for GWAS That Account for Population Stratification
To assess the effect of SNP i on a phenotypic trait in an additive model, the allelic data of the marker can be presented in a table as described in Table 2. For a perfectly mixed population, the marker can be tested with the χ 2 -test. However, population substructure and inbreeding can lead to an increased homozygosity, which induces extra-binomial variance, leading to spurious associations. Within a genotype, this increased variance can be accounted for with the Cochran-Armitage trend test, where each SNP i can be tested for whether it satisfies the following assumption: However, the trend test does not take into account the increased allelic variance between genotypes caused by population substructures and hidden relatedness. We selected four algorithms, which are used to account for population stratification, and implemented them using secure computing techniques.

AA AB BB Total
Case To correct for the effect of population stratification, this algorithm [25] estimates the variance inflation factor, which measures the increase in allelic variance across individuals within the samples. Genomic control assumes that this factor is a constant for all markers, and it can therefore be estimated using the Cochran-Armitage test statistics of all markers. To lower the rate of false positives caused by cryptic relatedness and population stratification, the test statistics are divided by the computed estimate. The resulting statistics are approximately distributed χ 2 with one degree of freedom under the null hypothesis.
While powerful, it has its limitations. Mainly, if the markers are not uniformly differentiated across ancestral populations, the estimation of the inflation factor can be too small, and therefore insufficient, for some markers, leading to a number of false positive associations. For other markers, the inflation factor can be too large, thereby becoming superfluous.
Principal component analysis. It has been shown that for samples of different subpopulations, the first principal components of the genotype data describe the differences in ancestry [20]. Therefore, it is possible to correct for stratification by adjusting the genotype and phenotype data using the principal components as covariates, as is done in the EIGEN-STRAT algorithm [20]. For hypothesis testing, a generalisation of the Cochran-Armitage trend test can be used on the adjusted genotypes and phenotypes. Unlike the genomic control algorithm, this approach does not adjust all markers uniformly. Therefore, it can better capture the differences in allele frequencies caused by different ancestrial populations.
The main drawback of principal component analysis is that it is often unclear how many and which principal components capture the population substructure and what other top principal components capture in the sample structure. Usually up to 10 principal components are used to adjust for population stratification.
FastPCA [23] is a variation of principal component analysis. It uses recent advances in random matrix theory to reduce the computational effort in approximating the first principal components, which are simply the top eigenvectors of the kinship matrix. We chose FastPCA because computing eigenvectors in the privacy-preserving environment is very time consuming, as are all operations that deal with floating-point computations, and we do not need to compute exact eigenvectors.
EMMAX. An even more elaborate way is to use linear mixed models to correct for population stratification with each SNP as a fixed effect [24]. The relatedness of individuals in the sample are captured by the variance components. To assess the effect of the SNP to the phenotype, we first estimate these variance components, after which we can test the significance of the fixed effect by using a t-test.
The EMMAX algorithm is computationally more demanding than the previous algorithms. However, it is often statistically more powerful because it is able to better capture and correct for cryptic relatedness, especially for smaller sample substructures.

Deployment Models for Privacy-Preserving GWAS
In a typical collaborative genome-wide association study setting, two or more biobanks or service providers run collaborative analysis on data they have collected from their donors. Due to regulatory requirements or personal preferences, the processing of donor data should be minimised so that data leaks can be kept to a minimum. However, donors may still be willing to participate in research if a privacy-preserving solution is available.
Each biobank holds genotype data and medical diagnoses of a number of individuals. The analyst wants to run the genome-wide association study on the combined data of the gene banks. Since a person's genotype data are sensitive, the banks cannot reveal their data to each other. The analyst must not be able to determine any information about the individuals in the study by performing the analysis, and they should only receive the names of the significant SNPs. For a more detailed analysis of the usage models of MPC in GWAS, refer to [6].
Either the donors or the biobanks act as the input parties, who provide protected data for the computation. The biobanks, universities or dedicated service providers act as computing parties who host and run the secure computing system. The system is used by researchers in, e.g., universities or pharmaceutical companies who want to study a specific disease. They are the result parties in the system.

Cryptographic Secure Multi-Party Computation
In secure multi-party computation (MPC), two or more parties compute a function without seeing any of the private input values of the other parties. Most often secret sharing, garbled circuits or homomorphic encryption are used to enable MPC [27]. MPC technology has reached a level of maturity required to enable real-world implementations [28]. Data processing applications are typically implemented using one of the programmable MPC frameworks [29].
In this paper, we prototype our implementation using the SHAREMIND MPC platform [26]. SHAREMIND MPC is a distributed platform for privacy-preserving data processing supporting multiple MPC protocol sets, which use secret sharing as the secure storage method. Each value that needs to be protected is shared by the input parties according to the secret sharing algorithm. The resulting random shares are distributed between multiple computing parties.
Each computing party has a copy of the algorithms and runs them using MPC protocols that convert secret-shared inputs into secret-shared outputs, without recovering the private values on any computing party. Such MPC systems provide end-to-end security for data processing if a sufficient number of computing parties follow the protocol. If one or more do not, it may cause the computation to reach the wrong result or even breach privacy. The particular protocol set determines the exact failure mode. Efficient protocol sets are available for two and three computing parties, and adding more parties increases the communication complexity, a common bottleneck in MPC based on secret sharing.

Trusted Execution Environments
A trusted execution environment (TEE) is a secure area of a processor. It ensures the confidentiality and integrity of the the data that are loaded into this area. It is isolated from the rest of the processor, and the computations cannot be monitored. In a standard deployment, data are first sent to the TEE. Next, either computations are run immediately or data can be sealed and stored outside the TEE (kept on the disk in an encrypted form). For computing, the data are loaded into the TEE and decrypted there. After the computations have been completed, results are either encrypted and stored on the disk, or published. The host will not have access to the encryption keys for the data nor be able see decrypted data in any other way.
The Intel ® Software Guard eXtensions (SGX) has emerged as a popular way to create trusted execution environments. SGX provides features such as enclaves, attestation and data sealing to protect data and allow for remote audit of processing and privacy policy enforcement. In SGX, enclaves are protected memory areas that protect the confidentiality and integrity of data even if there is malware in the system. Attestation helps the system prove to external parties that it is running the trusted software and the correct version of the application. Data sealing (encryption) is used to store data outside the enclave while protecting its confidentiality and integrity. Only the specific enclave knows the encryption key and can decrypt the data.
Trusted execution environments do not require distributed storage; thus they can be used to build a system where there is just a single computing party. However, that computing party has a different trust model than computing parties in MPC. Notably, one has to trust the provider of the hardware that is used to create the trusted execution environments and that it has been implemented correctly. Second, applications of TEEs need to account for side channel attacks that observe execution time, power usage or the electromagnetic radiation of the hardware while the privacy-preserving task is running. These can be used to recover the encrypted data in the enclave. At the same time, TEEs can perform privacy-preserving operations significantly faster than MPC, and thus the trade-off between their secure models requires study. See [30] for a recent review of SGX attacks and their mitigations.
SHAREMIND HI (HI standing for hardware isolation) is a privacy-preserving data processing platform that uses SGX enclaves, sealing, and attestation to support data analysis processes between identified participants and provides end-to-end privacy for the protected data for applications just as SHAREMIND MPC does.

Adapting Algorithms for Secure Computing
The privacy-preserving algorithms that we design in this paper minimise data leakage. Our design process follows the security goals for privacy-by-design statistical analysis algorithms as defined by [31] (The authors of [31] also note the importance of query restrictions that prevent new, possibly insecure, algorithms from being executed on a secure computing platform before whitelisting. However, this does not affect the design of algorithms and only affects the choice of the secure computing platform the algorithms will run on). The GWAS algorithms in this paper are designed to avoid leaking information based on the order of donors' data in the inputs. The order of SNPs in the inputs will affect the SNPs in the output, but this is by design as the related metadata are not private data. 3.
Output privacy. An algorithm is output-private if the results it declassifies do not leak the private inputs. The GWAS algorithms in this paper can be used either as parts of larger processing workflows, and in this case they do not declassify outputs. However, the results of the algorithms are statistical significances per SNP and contain no personal data given a sufficient amount of inputs.
The algorithms also need adaptation for performance. In secure multi-party computation based on secret sharing, servers need to communicate with each other to run the shared computations. It has been shown that redesigning algorithms with maximum parallelisation helps use the communication channel more efficiently, reducing the round count and improving speed. Thus, the algorithms in this paper are adapted to use single-instruction, multiple-data (SIMD) vector and matrix operations to the maximum reasonable extent.
The algorithms are designed for secure computing frameworks supporting integer, fixed and floating-point arithmetic. In this paper, we implement the algorithms using the SHAREMIND MPC and SHAREMIND HI platforms as they support a comparable application model on different secure computing technologies. This allows us to compare the techniques. However, the algorithms would work on any secure computing technique, including fully homomorphic encryption or zero knowledge.
In the following algorithm representations, we use the notation x to denote a value x that is protected by the selected secure computing system. For example, in an MPC runtime based on secret sharing, x means that x has been secretly shared among multiple parties so that no party can recover it. For trusted execution environments, x means that x is only processed within the enclave. Similarly, x , X denote protected data structures (vector x and matrix X, respectively). For a matrix X and indexes i, j, k, l, we denote by X[i : j, k : l] the submatrix of X consisting of elements from rows i, . . . , j − 1 and columns k, . . . , l − 1. A missing index means that we take all the elements up until or starting from an index.
The operator * denotes element-wise multiplication between two matrices or multiplication by a scalar and is used to distinguish this operation from matrix multiplication. The operator/denotes element-wise division by scalar, vector or matrix, as necessary. The functions rowSums and colSums compute the sum of each row or column of a matrix, respectively. The output is a vector.
The function reshape takes as input a vector and two integers n and m and reshapes the vector row-wise into an n × m matrix. The function flatten flattens a matrix row-wise into a vector. The function choose takes as inputs a Boolean vector m and two vectors x y of the same type and chooses elements point-wise from either x or y based on the values in vector m . The operator cut takes as arguments a Boolean vector m and a same size vector x of any type and discards the elements point-wise from x if the corresponding value in m is false. While the size of the output vector can leak information, since the size of the output vector is equal to the number of true values in m , in our implementations, we make sure that the function always outputs a vector with a single value. This guarantees that no information is leaked.

Results
In this section, we describe the privacy-preserving versions of the algorithms and give the benchmark results. We first describe and discuss EIGENSTRAT and FastPCA, and then we go to EMMAX and finally we describe the genomic control algorithm. The EIGENSTRAT and EMMAX algorithms use the privacy-preserving GS-PCA algorithm from [8]. GS-PCA is highly parallelisable and therefore especially suits the secure multi-party setting. Our privacy-preserving genomic control algorithm uses the privacy-preserving version of quicksort from [32].
The nature of the algorithm adaptation was twofold: firstly, we made the modifications that were needed for the algorithm to work in the privacy-preserving setting, and secondly, we added implementation details for SHAREMIND, the MPC framework we used. To lower the computing time while maintaining accuracy, our MPC implementations use a mix of floating-point and fixed-point arithmetic. The conversion was only done in cases where the accuracy loss was at acceptable levels.
It is important to note that the algorithms can be viewed separately from their implementations, as the algorithms can be implemented for any MPC platform that has the necessary fixed-point and floating-point arithmetic available. The fixed-point and floating-point conversions that are discussed in these subsections can be beneficial for implementations on other platforms as well, and, therefore, we have added these directly into the algorithm description.

Privacy-Preserving EIGENSTRAT and FastPCA
Algorithm 1 describes the privacy-preserving version of EIGENSTRAT [20]. As input it receives the genotype matrix X , the corresponding matrix of mask vectors M , and the phenotype vector y . It also receives the number of components k and the number of GS-PCA iterations J. According to [20], k is usually 10 but can also be 1, 2 or 5. In our benchmark tests, we set k = 2. The GS-PCA algorithm requires the number of iterations J as an input. Using a public predetermined number of iterations avoids side channel attacks based on the number of iterations. In our benchmarks, we set J to 15 as this gives us sufficient precision.
We start by computing the vector µ of the means of the columns and use this to normalise the columns. In the MPC implementation, we use floating-point arithmetic to compute the normalised genotype matrix Z , and we convert this matrix to fixedpoint values before finding the scaled kinship matrix K ← Z Z T , which is n − 1 times the covariance matrix. We do not perform the division with n − 1, as this is an expensive operation, and it does not influence the properties of eigenvalues that we need for our computations.
Using the privacy-preserving GS-PCA algorithm, we find the vector λ of the k largest eigenvalues and the n × k matrix U of corresponding eigenvectors of matrix K . In this algorithm, we do not need the vector λ of eigenvalues, so we simply drop this result. Our MPC implementation of the GS-PCA algorithm also uses fixed-point arithmetic. To adjust for stratification, the entries are converted back to floating-point values as this keeps the accuracy loss to a minimum. Using the eigenvectors from U , we iteratively adjust the genotype matrix and phenotype vector and then compute the trend statistics using the adjusted values. No values are declassified during the execution.
The biggest issue with this algorithm in the privacy-preserving setting is that finding ZZ T requires a lot of computational effort. The computation time of the GS-PCA algorithm also increases quadratically as the number of rows in Z (the size of the sample) increases. To deal with this, we can instead use FastPCA [23], which makes use of random matrix theory.
The privacy-preserving implementation is fairly straightforward. We use the Marsaglia polar method to generate the entries for the random matrix Gi . To speed up the process of finding U k , we used fixed-point not floating-point values. The biggest difference in fixed and floating-point computation comes from computing G i (ZZ T ) J Z . While this process is faster when done with fixed-point values, it causes accuracy loss in some cases.
It is important to keep in mind that when using fixed-point values, we can get overflows during the iterative matrix multiplication. To prevent this, we divide the product matrix G i (ZZ T ) i with a suitable constant after each iterative step.
For QR factorisation and eigendecomposition, we used the Householder QR algorithm and the symmetric QR algorithm [33], respectively. Having found the k top principal components, we can use them to adjust our genotype and phenotype vectors like in EIGENSTRAT. Again, no values are declassified during the execution.

Privacy-Preserving EMMAX
Algorithm 3 describes the privacy-preserving version of EMMAX [24]. Although the algorithm can generally be used for the analysis of quantitative phenotypic traits, we implemented it with case-control studies in mind. Therefore, as input, the algorithm takes a similar genotype matrix X as EIGENSTRAT, the mask matrix M indicating the missing values, the phenotype vector y indicating the inclusion into the case group, and the number of GS-PCA iterations J.
We use a mix of fixed-point and floating-point arithmetic in our MPC implementation of the EMMAX algorithm. We start by normalising the columns of X . As with the privacy-preserving EIGENSTRAT, we use floating-point arithmetic to compute the normalised genotype matrix Z and convert the entries to fixed-point values to compute the the covariance matrix K ← Z Z T . We then use the GS-PCA algorithm to find the eigendecomposition of K . The resulting eigenvectors and eignevalues are converted to floating-point values again. The rest of the computations in the MPC implementation use floating-point arithmetic. This keeps the accuracy loss to a minimum.
We denote by ξ the vector of eigenvalues and by U F the matrix of corresponding eigenvectors, by λ the vector of n − 1 largest eigenvalues and by U R the matrix of corresponding eigenvectors. After finding the vector η = U R T y , we create a 101 × (n − 1) matrix δMat, where every column has values ranging from 10 −5 to 10 5 . Creating 101 × (n − 1) matrices η2Mat and λMat , where the columns are vectors η 2 and λ , respectively, allows us to compute the value of the restricted log-likelihood function In order to estimate the linear coefficient β k of SNP k for every k = 1, . . . , m, we need to compute ( X k H X k T ) −1 X k H y for each k, where is an n × n matrix and Instead of computing each of these values one at a time, we define the following (m + 1) × n matrix Note that the first and (i + 1)th rows of matrix XH are equal to the rows of the 2 × n matrix X k H . We continue by computing the entries of the m × 4 matrix XHX . This matrix contains the entries of each matrix X k H X k T , more precisely As finding the inverse of a 2 × 2 matrix is easy, we can now create an m × 4 matrix iXHX , which contains the entries of ( X k H X k T ) −1 in the kth row. Let yMat be an (m + 1) × n matrix for which each row is equal to vector y . Next, we find the vector XHy , where the first and (i + 1)th entries are equal to the entries of the vector X k H y T . The estimate of β k can now be computed as We compute all these estimates in parallel. To find the vector of t-statistics, and use them to find s k for all k = 1, . . . , m.

Genomic Control
Algorithm 4 describes the privacy-preserving version of the genomic control algorithm [25]. As input it receives the genotype matrix X and the phenotype vector y . For this privacy-preserving algorithm, we need to manipulate the shape of the data to achieve better performance. As comparison is rather expensive and we would have to perform a significant number of comparisons to count the number of different combinations {AA, AB, BB}, we will convert the data into the following format: the columns of the matrix X indicate individuals and for each SNP we have three rows (for AA, AB and BB), where the value is 1 for the corresponding observation and the values for the other two rows are 0. For the missing SNP values, all three rows are marked as 0.
We would like to compute the entries in Table 2 for each SNP in parallel as parallel operations are optimal in the privacy-preserving setting. For this, we will find the sum of the columns corresponding to the case and control group separately. This gives us six vectors r i , s i , i ∈ {0, 1, 2}. From these we can compute R = r 0 + r 1 + r 2 , S = s 0 + s 1 + s 2 , N = R + S , n 1 = r 1 + s 1 , n 2 = r 2 + s 2 .

Performance Measurements
All of the algorithms running on SHAREMIND MPC are implemented in the SecreC programming language [34,35]. The trusted execution environment versions are implemented in C/C++ and require the SHAREMIND HI runtime. The source code for each is available for download as additional material. The data that we used to run and measure our implementation were taken from the NCBI Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44974, accessed on 19 August 2021). It is available for download. However, the contents of the dataset do not affect the performance results as the running times of the privacy-preserving algorithms are data-independent by design; otherwise this would leak information about the dataset.
As discussed in Section 2.3, secure multi-party computation introduces a level of complexity to already complex algorithms. In addition to designing privacy-preserving versions of frequently used GWAS algorithms, we also show that they are implementable and run on a volume of data that would be used for realistic genome-wide association studies. We implemented our algorithms using the privacy-preserving computation platform SHAREMIND. For MPC, we used the three-party protocol suite in the passive adversary model [36]. For HI, we used the Intel Software Guard Extensions (SGX).
First we present the performance results of the MPC solution, then the TEE solution, and finally we compare the two approaches. The performance of the MPC algorithms was benchmarked using three servers with Intel Xeon E5-2640 processors, 128 GB of memory and dedicated 10 Gb/s connections. In the benchmark tables, we show running times for the different subtasks of the algorithms, thus giving a better overview of where optimisation can be attempted. Table 3 presents benchmark results for our privacy-preserving EIGENSTRAT algorithm. Table preparation includes the time needed for reading the table and normalising it. Stratification control includes the time needed for finding the kinship matrix, subtracting the genotype and phenotype principal components. We looked at the running times for 1500, 2000, 5000 and 20,000 SNPs for 217 donors, and then also increased the number of donors to look at 2000 SNPs for 434 and 868 donors. The size of the case and control groups does not influence running times.  For our privacy-preserving EIGENSTRAT, the computation of the kinship matrix proved to be the biggest computational overhead, scaling linearly with respect to the number of SNPs and quadratically with respect to the size of the sample in our data table. The GS-PCA algorithm, used for finding the top eigenvectors of the kinship matrix, also scales quadratically with respect to the size of the sample. This is because the kinship matrix is an n × n matrix, where n is the number of people in our sample. As such, the running time of the GS-PCA algorithm does not depend on the number of SNPs. Other computational steps scale linearly with respect to the dimensions of the input matrix. Table 4 presents benchmark results for our privacy-preserving FastPCA randomised algorithm. Table preparation includes the time needed for reading the table and normal-ising it. Stratification control includes the time needed for subtracting the genotype and phenotype principal components. Similarly to EIGENSTRAT, we looked at the running times for 1500, 2000, 5000, and 20,000 SNPs for 217 donors, and then also increased the number of donors to look at 2000 SNPs for 434 and 868 donors. The size of the case and control groups does not influence running times.
The main benefit of FastPCA is that it does not require computing the kinship matrix, and, therefore, it scales linearly with respect to both the number of SNPs and the sample size. In addition, eigendecomposition is only applied to a small (k + p) × (k + p) matrix, which offers clear speed-ups with respect to the GS-PCA approach.  Table 5 presents benchmark results for our privacy-preserving EMMAX algorithm. Table preparation includes the time needed for reading the table and normalising it. The maximum likelihood times include everything in Algorithm 3 after the call to GS-PCA and before computing the test statistics. Most of the time in this group is spent on computing XHX . As the privacy-preserving EMMAX algorithm is significantly slower than the privacy-preserving PCA algorithm, we looked at the running times for 1000, 5000 and 20,000 SNPs for 217 donors and then also looked at 1000 SNPs for 100 and 434 donors. As can be seen from Table 5, the running time for 1000 SNPs and 434 donors is already more than 36 h, and we did not test the algorithm any further. The size of the case and control groups does not influence running times.  Similarly to EIGENSTRAT, our EMMAX implementation computes the kinship matrix, which scales linearly with respect to the number of SNPs and quadratically with respect to the size of the sample. We then use the GS-PCA algorithm to compute the eigenvectors and eigenvalues of the kinship matrix. However, for EMMAX, we need the entire eigendecompostion instead of a few eigenvectors. This means that the process takes significantly longer compared to the privacy-preserving PCA. It is clear that as the size of the genotype matrix grows, the computation of the XHX matrix quickly overshadows the computational effort of the rest of the algorithm. Table 6 presents benchmark results for our privacy-preserving genomic control algorithm. We performed these computations for the usual 5000 SNPs and 217 donors but then went on to test with far larger datasets than the PCA experiments. Here, most of the running time is spent on table preparation, which includes table reading and computing the genotype distribution table for the Cochran-Armitage tests. Note that due to the way data are managed in this algorithm, 300,000 SNPs will take up 900,000 rows in the dataset.
Benchmark details for the HI versions of PCA and EMMAX are given in Table 7. Unlike the MPC benchmarks, these results are given in milliseconds. The SHAREMIND HI tests were run on a PC with i7-7700 CPU and 16 GB RAM.   With the genomic control algorithm for SHAREMIND HI, the running time of 15,112 milliseconds for 300,000 SNPs and 200 donors was mostly spent on reading input data (1 ms of the total running time was spent on other computations). For comparison, we performed the experiments on the same data structures as for MPC. As discussed, this involves having three rows for each SNP. As reading data is an expensive operation in HI but comparing strings is not, the HI version can be optimised for real-world data encodings. After the data was read, the computations themselves took altogether one millisecond. Therefore, we did not run these tests for fewer SNPs. Table 8 presents the comparison of benchmark results for the SHAREMIND MPC and HI implementations of privacy-preserving GWAS algorithms with stratification control. As expected, the algorithms ran nearly instantaneously on SHAREMIND HI. The table also gives the running times if HI was 100 times slower, an artificial estimate for if we were to take more special measures to avoid side-channel attacks. For genomic control, this would still be around 20 s for reading input data, because the computation times themselves would still not exceed seconds.

Discussion
Feasibility. We showed how to adapt complex stratification control algorithms for genome-wide association studies to secure computing. We especially demonstrate the most complex PCA algorithms implemented in the literature on large datasets. In addition, we adapted the algorithms to trusted execution environments, showcasing how the sidechannel security of the proposed algorithms can benefit both MPC and TEEs.
Security. Both secure multi-party computation and trusted execution environments can offer cryptographic security; however, both solutions have their advantages and drawbacks. For MPC, it requires the service providers to set up at least two servers. More complex algorithms might not be implementable or might not finish execution in this environment. The computations that do finish introduce a major computation and communication overhead. However, when the input party trusts their data to the computing parties, they do not need to trust one single party but can trust that the parties do not collaborate. Moreover, to be certain, an input party can become a computing party and thus gain more confidence.
The trusted execution environment does not require such a complicated setup and collaboration from different parties. There is only one server and everyone can import their data in encrypted format. In addition, the environment does not introduce a significant computational or communication overhead. However, the input parties must trust the trusted execution environment provider (Intel). If something in the enclave fails, if a critical vulnerability is discovered, all data could be compromised. Designers of TEE-based data analysis must create a data lifecycle that reduces these risks.
Performance. It is clear that in the MPC setting, the running times of EMMAX are not feasible for real-world use for larger data volumes. It is a complex iterative algorithm that includes a lot of floating-point matrix multiplications. We fear that further optimisations would not yield a significant speed-up. However, genomic control shows that it is feasible to also carry out privacy-preserving GWAS with stratification control using MPC. The privacy-preserving EIGENSTRAT remains on the border of feasibility. However, for all four algorithms, we show that the computations can be run and be completed.
For the solution based on the Intel Software Guard Extensions (SGX), it is clear that the computations are efficient and run in similar time to the algorithms that do not use secure computing. However, even if side channel attack mitigations make the computation 100 times slower, it will outperform MPC, according to our experiments. Thus, further comparison of the security properties of real-world deployments of these technologies will be needed. Funding: This work has been supported by the EU H2020-SU-ICT-03-2018 Project No. 830929 CyberSec4Europe (http://cybersec4europe.eu, accessed on 19 August 2021). This research has also been supported by the European Regional Development Fund through the Estonian Centre of Excellence in ICT Research (EXCITE), and by the Estonian Research Council through grants PRG49 and PRG920.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data that we used to run and measure our implementation were taken from the NCBI Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/query/ acc.cgi?acc=GSE44974, accessed on 19 August 2021). It is available for download. However, the contents of the dataset do not affect the performance results as the running times of the privacypreserving algorithms are data-independent by design. Therefore, any other dataset (real or synthetic) can be used to reproduce the benchmark results in this paper.