# Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

^{®}Software Guard eXtensions (SGX) trusted execution environments [16,17,18,19].

^{®}Software Guard eXtensions (SGX) technology to create trusted execution environments and, thus, can be run securely in single server setting while Sharemind MPC requires a more complex setup.

## 2. Materials and Methods

#### 2.1. Genome-Wide Association Studies

#### 2.2. Algorithms for GWAS That Account for Population Stratification

**Genomic control.**To correct for the effect of population stratification, this algorithm [25] estimates the variance inflation factor, which measures the increase in allelic variance across individuals within the samples. Genomic control assumes that this factor is a constant for all markers, and it can therefore be estimated using the Cochran–Armitage test statistics of all markers. To lower the rate of false positives caused by cryptic relatedness and population stratification, the test statistics are divided by the computed estimate. The resulting statistics are approximately distributed ${\chi}^{2}$ with one degree of freedom under the null hypothesis.

**Principal component analysis.**It has been shown that for samples of different subpopulations, the first principal components of the genotype data describe the differences in ancestry [20]. Therefore, it is possible to correct for stratification by adjusting the genotype and phenotype data using the principal components as covariates, as is done in the EIGENSTRAT algorithm [20]. For hypothesis testing, a generalisation of the Cochran–Armitage trend test can be used on the adjusted genotypes and phenotypes. Unlike the genomic control algorithm, this approach does not adjust all markers uniformly. Therefore, it can better capture the differences in allele frequencies caused by different ancestrial populations.

**EMMAX.**An even more elaborate way is to use linear mixed models to correct for population stratification with each SNP as a fixed effect [24]. The relatedness of individuals in the sample are captured by the variance components. To assess the effect of the SNP to the phenotype, we first estimate these variance components, after which we can test the significance of the fixed effect by using a t-test.

#### 2.3. Privacy-Preserving Computation and GWAS

#### 2.3.1. Deployment Models for Privacy-Preserving GWAS

#### 2.3.2. Cryptographic Secure Multi-Party Computation

#### 2.3.3. Trusted Execution Environments

^{®}Software Guard eXtensions (SGX) has emerged as a popular way to create trusted execution environments. SGX provides features such as enclaves, attestation and data sealing to protect data and allow for remote audit of processing and privacy policy enforcement. In SGX, enclaves are protected memory areas that protect the confidentiality and integrity of data even if there is malware in the system. Attestation helps the system prove to external parties that it is running the trusted software and the correct version of the application. Data sealing (encryption) is used to store data outside the enclave while protecting its confidentiality and integrity. Only the specific enclave knows the encryption key and can decrypt the data.

#### 2.3.4. Adapting Algorithms for Secure Computing

**Cryptographic security.**During computation, no computing party learns any private input or intermediate value computed by the function unless the value is explicitly published. Private values are also not leaked through changes in the running time of the algorithm. We achieve this by designing algorithms that process all private values either by using MPC and TEEs or by making the algorithm running time as independent of the inputs as possible. This means designing the algorithms as straight-line programs that have no branching based on private values or where each branching decision is analysed with regard to what it could leak in the running time.**Source privacy.**An algorithm is source-private if all its outputs and all intermediate values do not depend on the order of inputs. The GWAS algorithms in this paper are designed to avoid leaking information based on the order of donors’ data in the inputs. The order of SNPs in the inputs will affect the SNPs in the output, but this is by design as the related metadata are not private data.**Output privacy.**An algorithm is output-private if the results it declassifies do not leak the private inputs. The GWAS algorithms in this paper can be used either as parts of larger processing workflows, and in this case they do not declassify outputs. However, the results of the algorithms are statistical significances per SNP and contain no personal data given a sufficient amount of inputs.

## 3. Results

#### 3.1. Privacy-Preserving EIGENSTRAT and FastPCA

Algorithm 1: Privacy-preserving EIGENSTRAT |

Algorithm 2: Privacy-preserving FastPCA randomised algorithm for top k left singular vectors |

#### 3.2. Privacy-Preserving EMMAX

Algorithm 3: Privacy-preserving EMMAX |

Input: Genotype matrix $\u27e6\mathbf{X}\u27e7$ (private), mask matrix $\u27e6\mathbf{M}\u27e7$ (private), phenotype vector $\u27e6\mathbf{y}\u27e7$ (private), number of GS-PCA iterations J (public) |

Output: Vector $\mathbf{t}$ of statistics for the t-test$\mathbf{1}\u27e6\mathbf{\mu}\u27e7\leftarrow \mathbf{colSums}(\u27e6\mathbf{X}\u27e7)/\mathbf{colSums}(\u27e6\mathbf{M}\u27e7)$; $\mathbf{2}\u27e6\mathbf{Z}\u27e7\leftarrow (\u27e6\mathbf{X}\u27e7-\u27e6\mathbf{M}\u27e7\ast \u27e6\mathbf{\mu}\u27e7)/\left(\mathbf{sqrt}\right(\u27e6\mathbf{\mu}\u27e7/2(1-\u27e6\mathbf{\mu}\u27e7/2)\left)\right)$; $\mathbf{3}\u27e6\mathbf{K}\u27e7\leftarrow \u27e6\mathbf{Z}\u27e7\u27e6\mathbf{Z}{\u27e7}^{T}$; $\mathbf{4}\u27e6\mathbf{\xi}\u27e7,\u27e6{\mathbf{U}}_{\mathbf{F}}\u27e7\leftarrow \mathbf{GS}-\mathbf{PCA}(\u27e6\mathbf{K}\u27e7,n,J)$; $\mathbf{5}\u27e6\mathbf{\lambda}\u27e7\leftarrow \u27e6\mathbf{\xi}[:n-1]\u27e7$; $\mathbf{6}\u27e6{\mathbf{U}}_{\mathbf{R}}\u27e7\leftarrow \u27e6{\mathbf{U}}_{\mathbf{F}}[:,:n-1]\u27e7$; $\mathbf{7}\u27e6\mathbf{\eta}\u27e7\leftarrow \u27e6{\mathbf{U}}_{\mathbf{R}}{\u27e7}^{T}\u27e6\mathbf{y}\u27e7$; $\mathbf{8}\u27e6\mathbf{\eta}\mathbf{2}\u27e7\leftarrow \u27e6\mathbf{\eta}\u27e7\ast \u27e6\mathbf{\eta}\u27e7$; $\mathbf{9}\u27e6\mathbf{\eta}\mathbf{2}\mathbf{Mat}\u27e7\leftarrow \u27e6\mathbf{\eta}\mathbf{2}\u27e7$ // $\mathbf{\eta}\mathbf{2}\mathbf{Mat}$ is a $101\times (n-1)$ matrix, each row is equal to $\mathbf{\eta}\mathbf{2}$; $\mathbf{10}\u27e6\mathbf{\lambda}\mathbf{Mat}\u27e7\leftarrow \u27e6\mathbf{\lambda}\u27e7$ // $\mathbf{\lambda}\mathbf{Mat}$ is a $101\times (n-1)$ matrix, each row is equal to $\mathbf{\lambda}$; $\mathbf{11}\mathbf{\delta}\leftarrow ({10}^{-5},\dots ,{10}^{5})$; $\mathbf{12}\mathbf{\delta}\mathbf{Mat}\leftarrow \mathbf{\delta}$ // $\mathbf{\delta}\mathbf{Mat}$ is a $101\times (n-1)$ matrix, each column is equal to $\mathbf{\delta}$; $\mathbf{13}\u27e6\Lambda \mathbf{p}\mathbf{\delta}\u27e7\leftarrow \u27e6\mathbf{\lambda}\mathbf{Mat}\u27e7+\mathbf{\delta}\mathbf{Mat}$; $\mathbf{14}\u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7\leftarrow -(n-1)\mathbf{ln}\left(\mathbf{colSums}(\u27e6\mathbf{\eta}\mathbf{2}\mathbf{Mat}\u27e7/\u27e6\Lambda \mathbf{p}\mathbf{\delta}\u27e7)\right)-\mathbf{colSums}\left(\mathbf{ln}(\u27e6\Lambda \mathbf{p}\mathbf{\delta}\u27e7)\right)$; $\mathbf{15}\u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7\leftarrow \u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7+\mathbf{\delta}\ast (\mathbf{max}(\u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7)==\u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7)$ // Adds a different positive constant to each maximum value in ${\mathbf{f}}_{\mathbf{pot}}$ to make sure that there is a single largest value in the vector, prevents leakage when using $\mathbf{cut}$ to find ${\delta}_{\mathrm{pot}}$; $\mathbf{16}\u27e6{\delta}_{\mathrm{pot}}\u27e7\leftarrow \mathbf{cut}(\u27e6\mathbf{\delta}\u27e7,\mathbf{max}(\u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7)==\u27e6{\mathbf{f}}_{\mathbf{pot}}\u27e7)$; $\mathbf{17}\u27e6{\delta}_{max}\u27e7\leftarrow \mathbf{Newton}\u2014\mathbf{Rhapson}(\u27e6{\delta}_{\mathrm{pot}}\u27e7)$ //Newton–Rhapson algorithm; $\mathbf{18}\u27e6\mathbf{X}\mathbf{1}\u27e7\leftarrow \u27e6\mathbf{X}\u27e7$; $\mathbf{19}\u27e6\mathbf{diag}\u27e7\leftarrow 1/(\u27e6\mathbf{\xi}\u27e7+\u27e6{\delta}_{max}\u27e7)$; $\mathbf{20}\u27e6\mathbf{diagMat}\u27e7\leftarrow \u27e6\mathbf{diag}\u27e7$ // $\mathbf{diagMat}$ is a square matrix with $\mathbf{diag}$ at the diagonal; $\mathbf{21}\u27e6\mathbf{XH}\u27e7\leftarrow \u27e6\mathbf{X}\mathbf{1}\u27e7\u27e6{\mathbf{U}}_{\mathbf{F}}\u27e7\u27e6\mathbf{diagMat}\u27e7\u27e6{\mathbf{U}}_{\mathbf{F}}{\u27e7}^{T}$; $\mathbf{22}\u27e6\mathbf{diagXHX}\u27e7=\mathbf{rowSums}(\u27e6\mathbf{XH}\u27e7\ast \u27e6\mathbf{X}\mathbf{1}\u27e7)$; $\mathbf{23}\u27e6\mathbf{XHX}[:,0]\u27e7\leftarrow \u27e6\mathbf{diagXHX}\left[0\right]\u27e7$; $\mathbf{24}\u27e6\mathbf{XHX}[:,3]\u27e7\leftarrow \u27e6\mathbf{diagXHX}[1:]\u27e7$; $\mathbf{25}\u27e6\mathbf{XHX}[:,1]\u27e7\leftarrow \mathbf{rowSums}(\u27e6\mathbf{XH}\u27e7)$; $\mathbf{26}\u27e6\mathbf{XHX}[:,2]\u27e7\leftarrow \u27e6\mathbf{XHX}[:,1]\u27e7$; $\mathbf{27}\u27e6\mathbf{det}\u27e7\leftarrow \u27e6\mathbf{XHX}[:,0]\u27e7\ast \u27e6\mathbf{XHX}[:,3]\u27e7-\u27e6\mathbf{XHX}[:,1]\u27e7\ast \u27e6\mathbf{XHX}[:,2]\u27e7$; $\mathbf{28}\u27e6\mathbf{iXHX}[:,0]\u27e7\leftarrow \u27e6\mathbf{XHX}[:,3]\u27e7/\u27e6\mathbf{det}\u27e7$; $\mathbf{29}\u27e6\mathbf{iXHX}[:,3]\u27e7\leftarrow \u27e6\mathbf{XHX}[:,0]\u27e7/\u27e6\mathbf{det}\u27e7$; $\mathbf{30}\u27e6\mathbf{iXHX}[:,1]\u27e7\leftarrow -\u27e6\mathbf{XHX}[:,1]\u27e7/\u27e6\mathbf{det}\u27e7$; $\mathbf{31}\u27e6\mathbf{iXHX}[:,2]\u27e7\leftarrow \u27e6\mathbf{iXHX}[:,1]\u27e7$; $\mathbf{32}\u27e6\mathbf{yMat}\u27e7\leftarrow \u27e6\mathbf{y}\u27e7$ // $\mathbf{yMat}$ is an $m\times n$ matrix, each row is equal to $\mathbf{y}$; $\mathbf{33}\u27e6\mathbf{XHy}\u27e7\leftarrow \mathbf{rowSums}(\u27e6\mathbf{XH}\u27e7\ast \u27e6\mathbf{yMat}\u27e7)$; $\mathbf{34}\u27e6\mathbf{\beta}\u27e7\leftarrow \u27e6\mathbf{iXHX}[:,2]\u27e7\ast \u27e6\mathbf{XHy}\left[0\right]\u27e7+\u27e6\mathbf{iXHX}[:,3]\u27e7\ast \u27e6\mathbf{XHy}[1:]\u27e7$; $\mathbf{35}\u27e6\mathbf{yMat}\mathbf{2}\u27e7\leftarrow \u27e6\mathbf{y}\u27e7$ // $\mathbf{yMat}\mathbf{2}$ is an $n\times n$ matrix, each row is equal to $\mathbf{y}$; $\mathbf{36}\u27e6{\mathbf{yU}}_{\mathbf{F}}\u27e7\leftarrow \mathbf{rowSums}(\u27e6\mathbf{yMat}\mathbf{2}\u27e7\ast \u27e6{\mathbf{U}}_{\mathbf{F}}\u27e7)$; $\mathbf{37}\u27e6yHy\u27e7\leftarrow \mathbf{sum}(\u27e6{\mathbf{yU}}_{\mathbf{F}}\u27e7\ast \u27e6\mathbf{diag}\u27e7\ast \u27e6{\mathbf{yU}}_{\mathbf{F}}\u27e7)$; $\mathbf{38}\u27e6\mathbf{R}\mathbf{1}\u27e7\leftarrow (\u27e6\mathbf{iXHX}[:,0]\u27e7\ast \u27e6\mathbf{XHy}\left[0\right]\u27e7+\u27e6\mathbf{iXHX}[:,2]\u27e7\ast \u27e6\mathbf{XHy}[1:]\u27e7)\ast \u27e6\mathbf{XHy}\left[0\right]\u27e7$; $\mathbf{39}\u27e6\mathbf{R}\mathbf{2}\u27e7\leftarrow \u27e6\mathbf{\beta}\u27e7\ast \u27e6\mathbf{XHy}[1:]\u27e7$; $\mathbf{40}\mathbf{return}\phantom{\rule{3.33333pt}{0ex}}\u27e6\mathbf{t}\u27e7\leftarrow \u27e6\mathbf{\beta}\u27e7/\mathbf{sqrt}(\u27e6\mathbf{iXHX}[:,3]\u27e7\ast (\u27e6yHy\u27e7-(\u27e6\mathbf{R}\mathbf{1}\u27e7+\u27e6\mathbf{R}\mathbf{2}\u27e7)\left)\right)$; |

#### 3.3. Genomic Control

Algorithm 4: Privacy-preserving genomic control |

#### 3.4. Performance Measurements

## 4. Discussion

**Feasibility.**We showed how to adapt complex stratification control algorithms for genome-wide association studies to secure computing. We especially demonstrate the most complex PCA algorithms implemented in the literature on large datasets. In addition, we adapted the algorithms to trusted execution environments, showcasing how the side-channel security of the proposed algorithms can benefit both MPC and TEEs.

**Security.**Both secure multi-party computation and trusted execution environments can offer cryptographic security; however, both solutions have their advantages and drawbacks. For MPC, it requires the service providers to set up at least two servers. More complex algorithms might not be implementable or might not finish execution in this environment. The computations that do finish introduce a major computation and communication overhead. However, when the input party trusts their data to the computing parties, they do not need to trust one single party but can trust that the parties do not collaborate. Moreover, to be certain, an input party can become a computing party and thus gain more confidence.

**Performance.**It is clear that in the MPC setting, the running times of EMMAX are not feasible for real-world use for larger data volumes. It is a complex iterative algorithm that includes a lot of floating-point matrix multiplications. We fear that further optimisations would not yield a significant speed-up. However, genomic control shows that it is feasible to also carry out privacy-preserving GWAS with stratification control using MPC. The privacy-preserving EIGENSTRAT remains on the border of feasibility. However, for all four algorithms, we show that the computations can be run and be completed.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Hartl, D.L.; Clark, A.G. Principles of Population Genetics, 4th ed.; Sinauer Associates: Sunderland, MA, USA, 2006. [Google Scholar]
- Hellwege, J.N.; Keaton, J.M.; Giri, A.; Gao, X.; Velez Edwards, D.R.; Edwards, T.L. Population Stratification in Genetic Association Studies. Curr. Protoc. Hum. Genet.
**2017**, 95, 1.22.1–1.22.23. [Google Scholar] [CrossRef] [PubMed] - Campbell, C.D.; Ogburn, E.L.; Lunetta, K.L.; Lyon, H.N.; Freedman, M.L.; Groop, L.C.; Altshuler, D.; Ardlie, K.G.; Hirschhorn, J.N. Demonstrating stratification in a European American population. Nat. Genet.
**2005**, 37, 868. [Google Scholar] [CrossRef] [PubMed] - European Data Protection Board. Recommendations 01/2020 on Measures that Supplement Transfer Tools to Ensure Compliance with the EU Level of Protection of Personal Data. 2020. Available online: https://edpb.europa.eu/our-work-tools/public-consultations-art-704/2020/recommendations-012020-measures-supplement-transfer_en (accessed on 19 August 2021).
- European Data Protection Supervisor. Preliminary Opinion 8/2020 on the European Health Data Space. 2020. Available online: https://edps.europa.eu/data-protection/our-work/publications/opinions/preliminary-opinion-82020-european-health-data-space_en (accessed on 19 August 2021).
- Kamm, L.; Bogdanov, D.; Laur, S.; Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics
**2013**, 29, 886–893. [Google Scholar] [CrossRef] [PubMed][Green Version] - Constable, S.D.; Tang, Y.; Wang, S.; Jiang, X.; Chapin, S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med. Inform. Decis. Mak.
**2015**, 15, S2. [Google Scholar] [CrossRef] [PubMed][Green Version] - Bogdanov, D.; Kamm, L.; Laur, S.; Sokk, V. Implementation and Evaluation of an Algorithm for Cryptographically Private Principal Component Analysis on Genomic Data. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2018**, 15, 1427–1432. [Google Scholar] [CrossRef] - Bonte, C.; Makri, E.; Ardeshirdavani, A.; Simm, J.; Moreau, Y.; Vercauteren, F. Towards practical privacy-preserving genome-wide association study. BMC Bioinform.
**2018**, 19, 537. [Google Scholar] [CrossRef][Green Version] - Cho, H.; Wu, D.J.; Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol.
**2018**, 36, 547–551. [Google Scholar] [CrossRef] - Tkachenko, O.; Weinert, C.; Schneider, T.; Hamacher, K. Large-Scale Privacy-Preserving Statistical Computations for Distributed Genome-Wide Association Studies. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security (ASIACCS’18), Incheon, Korea, 4–8 June 2018; ACM: New York, NY, USA, 2018; pp. 221–235. [Google Scholar]
- Bellafqira, R.; Ludwig, T.E.; Niyitegeka, D.; Génin, E.; Coatrieux, G. Privacy-Preserving Genome-Wide Association Study for Rare Mutations—A Secure FrameWork for Externalized Statistical Analysis. IEEE Access
**2020**, 8, 112515–112529. [Google Scholar] [CrossRef] - Poddar, R.; Kalra, S.; Yanai, A.; Deng, R.; Popa, R.A.; Hellerstein, J.M. Senate: A Maliciously-Secure MPC Platform for Collaborative Analytics. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; USENIX Association: Vancouver, BC, Canada, 2021. [Google Scholar]
- Zhang, Y.; Dai, W.; Jiang, X.; Xiong, H.; Wang, S. FORESEE: Fully Outsourced secuRe gEnome Study basEd on homomorphic Encryption. BMC Med. Inform. Decis. Mak.
**2015**, 15, S5. [Google Scholar] [CrossRef][Green Version] - Wang, S.; Zhang, Y.; Dai, W.; Lauter, K.E.; Kim, M.; Tang, Y.; Xiong, H.; Jiang, X. HEALER: Homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics
**2016**, 32, 211–218. [Google Scholar] [CrossRef][Green Version] - Chen, F.; Yang, H.; Kim, J.; Ohno-Machado, L.; Ding, S.; Jiang, X.; Wang, S.; Fox, D.; Lauter, K.; Lu, Y.; et al. PRINCESS: Privacy-protecting Rare disease International Network Collaboration via Encryption through Software guard extensionS. Bioinformatics
**2016**, 33, 871–878. [Google Scholar] [CrossRef][Green Version] - Asvadishirehjini, A.; Kantarcioglu, M.; Malin, B. A Framework for Privacy-Preserving Genomic Data Analysis Using Trusted Execution Environments. In Proceedings of the 2020 Second IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), Atlanta, GA, USA, 28–31 October 2020; pp. 138–147. [Google Scholar]
- Kockan, C.; Zhu, K.; Dokmai, N.; Karpov, N.; Kulekci, M.O.; Woodruff, D.P.; Sahinalp, S.C. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat. Methods
**2020**, 17, 295–301. [Google Scholar] [CrossRef] - Pascoal, T.; Decouchant, J.; Boutet, A.; Veríssimo, P. DyPS: Dynamic, Private and Secure GWAS. In Proceedings of the Privacy Enhancing Technologies (PoPETS), Online, 12–16 July 2021; Volume 2, pp. 214–234. [Google Scholar]
- Price, A.L.; Patterson, N.J.; Plenge, R.M.; Weinblatt, M.E.; Shadick, N.A.; Reich, D. Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies. Nat. Genet.
**2006**, 38, 904–909. [Google Scholar] [CrossRef] [PubMed] - Simmons, S.; Sahinalp, C.; Berger, B. Enabling Privacy-Preserving GWASs in Heterogeneous Human Populations. Cell Syst.
**2016**, 3, 54–61. [Google Scholar] [CrossRef] [PubMed][Green Version] - Mittos, A.; Malin, B.; Cristofaro, E.D. Systematizing Genomic Privacy Research—A Critical Analysis. arXiv
**2017**, arXiv:abs/1712.02193. [Google Scholar] - Galinsky, K.J.; Bhatia, G.; Loh, P.R.; Georgiev, S.; Mukherjee, S.; Patterson, N.J.; Price, A.L. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet.
**2016**, 98, 456–472. [Google Scholar] [CrossRef][Green Version] - Kang, H.M.; Sul, J.H.; Service, S.K.; Zaitlen, N.A.; Kong, S.Y.; Freimer, N.B.; Sabatti, C.; Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet.
**2010**, 42, 348–354. [Google Scholar] [CrossRef][Green Version] - Devlin, B.; Roeder, K. Genomic Control for Association Studies. Biometrics
**1999**, 55, 997–1004. [Google Scholar] [CrossRef] [PubMed] - Bogdanov, D. Sharemind: Programmable Secure Computations with Practical Applications. Ph.D. Thesis, University of Tartu, Tartu, Estonia, 2013. [Google Scholar]
- Cramer, R.; Damgård, I.; Nielsen, J. Secure Multiparty Computation and Secret Sharing; Cambridge University Press: New York, NY, USA, 2015. [Google Scholar]
- Archer, D.W.; Bogdanov, D.; Lindell, Y.; Kamm, L.; Nielsen, K.; Pagter, J.I.; Smart, N.P.; Wright, R.N. From Keys to Databases—Real-World Applications of Secure Multi-Party Computation. Comput. J.
**2018**, 61, 1749–1771. [Google Scholar] [CrossRef][Green Version] - Hastings, M.; Hemenway, B.; Noble, D.; Zdancewic, S. SoK: General Purpose Compilers for Secure Multi-Party Computation. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 1220–1237. [Google Scholar]
- Randmets, J. An Overview of Vulnerabilities and Mitigations of Intel SGX Applications; Technical Report D-2-116; Cybernetica AS: Tallinn, Estonia, 2021. [Google Scholar]
- Bogdanov, D.; Kamm, L.; Laur, S.; Sokk, V. Rmind: A tool for cryptographically secure statistical analysis. IEEE Trans. Depend. Secur. Comput.
**2016**, 15, 481–495. [Google Scholar] [CrossRef][Green Version] - Bogdanov, D.; Laur, S.; Talviste, R. A Practical Analysis of Oblivious Sorting Algorithms for Secure Multi-party Computation. In Proceedings of the 19th Nordic Conference on Secure IT Systems (NordSec 2014), Tromsø, Norway, 15–17 October 2014; LNCS; Springer: Cham, Switzerland, 2014; Volume 8788, pp. 59–74. [Google Scholar]
- Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; John Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
- Laud, P.; Randmets, J. A Domain-Specific Language for Low-Level Secure Multiparty Computation Protocols. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (ACM 2015), Denver, CO, USA, 12–16 October 2015; pp. 1492–1503. [Google Scholar]
- Randmets, J. Programming Languages for Secure Multi-Party Computation Application Development. Ph.D. Thesis, University of Tartu, Tartu, Estonia, 2017. [Google Scholar]
- Bogdanov, D.; Niitsoo, M.; Toft, T.; Willemson, J. High-performance secure multi-party computation for data mining applications. Int. J. Inf. Secur.
**2012**, 11, 403–418. [Google Scholar] [CrossRef]

Donor ID | ${SNP}_{1}$ | ${SNP}_{2}$ | … | ${SNP}_{m}$ | Phenotype Vector y |
---|---|---|---|---|---|

${D}_{1}$ | 0 | 1 | … | 0 | 0 |

${D}_{2}$ | 1 | 0 | … | 0 | 1 |

… | … | … | … | … | … |

${D}_{n}$ | 2 | 0 | … | 2 | 1 |

Group | $AA$ | $AB$ | $BB$ | Total |
---|---|---|---|---|

Case | ${r}_{0}$ | ${r}_{1}$ | ${r}_{2}$ | R |

Control | ${s}_{0}$ | ${s}_{1}$ | ${s}_{2}$ | S |

Total | ${n}_{0}$ | ${n}_{1}$ | ${n}_{2}$ | N |

Subtask | 1500 SNPs | 2000 SNPs | 5000 SNPs | 20,000 SNPs | 2000 SNPs | 2000 SNPs |
---|---|---|---|---|---|---|

217 Donors | 217 Donors | 217 Donors | 217 Donors | 434 Donors | 868 Donors | |

Table preparation | 67.4 s | 85.4 s | 218.4 s | 865.1 s | 161.6 s | 347.0 s |

GS-PCA | 182.8 s | 187.9 s | 183.3 s | 704.8 s | 685.8 s | 2606.0 s |

Stratification control | 1365.9 s | 1849.5 s | 4520.2 s | 18,684.3 s | 6214.4 s | 22,112.0 s |

Test statistics | 136.2 s | 174.5 s | 436.6 s | 1681.7 s | 344.2 s | 675.4 s |

Total | 1752.4 s | 2297.2 s | 5358.4 s | 21,413.2 s | 7406.0 s | 25,740.4 s |

(29.2 min) | (38.3 min) | (89.3 min) | (5.95 h) | (2.06 h) | (7.15 h) |

Subtask | 1500 SNPs | 2000 SNPs | 5000 SNPs | 20,000 SNPs | 2000 SNPs | 2000 SNPs |
---|---|---|---|---|---|---|

217 Donors | 217 Donors | 217 Donors | 217 Donors | 434 Donors | 868 Donors | |

Table preparation | 75.7 s | 102.1 s | 247.9 s | 1000.8 s | 198.1 s | 393.9 s |

PCA using random | 2482.0 s | 3290.9 s | 7920.7 s | 31,355.6 s | 6194.8 s | 11,900.2 s |

matrix theory | ||||||

Stratification control | 216.3 s | 293.7 s | 707.3 s | 2800.8 s | 564.9 s | 1139.3 s |

Test statistics | 159.1 s | 204.9 s | 505.6 s | 1998.5 s | 407.9 s | 815.6 s |

Total | 2933.0 s | 3891.5 s | 9381.5 s | 37,155.7 s | 7365.7 s | 14,249.0 s |

(48.9 min) | (64.9 min) | (2.61 h) | (10.32 h) | (2.04 h) | (3.96 h) |

Subtask | 1000 SNPs | 5000 SNPs | 20,000 SNPs | 1000 SNPs | 1000 SNPs |
---|---|---|---|---|---|

217 Donors | 217 Donors | 217 Donors | 100 Donors | 434 Donors | |

Table preparation | 44.7 s | 239.3 s | 891.0 s | 24.7 s | 89.8 s |

Kinship matrix | 700.0 s | 3420.2 s | 13,988.5 s | 173.3 s | 2844.0 s |

GS-PCA | 14,241.6 s | 14,209.2 s | 14,239.3 s | 1912.8 s | 104,067.3 s |

Maximum likelihood | 4934.3 s | 21,381.8 s | 81,961.9 s | 1147.5 s | 23,185.7 s |

Test statistics | 0.4 s | 1.7 s | 7.6 s | 0.4 s | 0.4 s |

Total | 19,920.9 s | 39,252.5 s | 111,088.3 s | 3258.7 s | 130,187.1 s |

(5.53 h) | (10.90 h) | (30.86 h) | (54.3 min) | (36.2 h) |

Subtask | 5000 SNPs | 300,000 SNPs | 300,000 SNPs | 300,000 SNPs |
---|---|---|---|---|

217 Donors | 217 Donors | 434 Donors | 661 Donors | |

Table preparation | 33.8 s | 1912.3 s | 3619.5 s | 5451.1 s |

Cochran–Armitage test | 0.3 s | 24.1 s | 23.5 s | 23.8 s |

Stratification control | 0.9 s | 83.3 s | 83.4 s | 83.6 s |

Total | 34.9 s | 2019.7 s | 3726.4 s | 5558.5 s |

(0.6 min) | (33.7 min) | (62.1 min) | (92.6 min) |

Subtask | EIGENSTRAT | EMMAX | ||||
---|---|---|---|---|---|---|

2000 SNPs | 5000 SNPs | 20,000 SNPs | 2000 SNPs | 1000 SNPs | 5000 SNPs | |

217 Donors | 217 Donors | 217 Donors | 868 Donors | 217 Donors | 217 Donors | |

Table preparation | 101 ms | 253 ms | 1600 ms | 272 ms | 28 ms | 215 ms |

Kinship matrix | 219 ms | 606 ms | 2642 ms | 3552 ms | 111 ms | 609 ms |

GS-PCA | 46 ms | 53 ms | 69 ms | 735 ms | 4202 ms | 4220 ms |

Stratification control/ | 443 ms | 1337 ms | 15675 ms | 2782 ms | ||

Maximum likelihood | 26 ms | 25 ms | ||||

Test statistics | 74 ms | 175 ms | 2489 ms | 442 ms | 409 ms | 2076 ms |

Total | 883 ms | 2424 ms | 22475 ms | 7783 ms | 4776 ms | 7145 ms |

**Table 8.**Comparison of the Sharemind MPC and HI implementations of privacy-preserving GWAS algorithms with stratification control.

Experiment | Data Size | MPC | HI | If HI Was 100 Times Slower |
---|---|---|---|---|

EIGENSTRAT | 5000 SNPs | 5358.4 s | 2.42 s | 242 s |

200 donors | (89.3 min) | |||

EMMAX | 5000 SNPs | 87,400 s | 7.14 s | 714 s |

200 donors | (24.3 h) | |||

Genomic control | 300,000 SNPs | 2096.5 s | 15.11 s | ∼20 s |

200 donors | (35 min) | (reading input data) | (reading input data) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ostrak, A.; Randmets, J.; Sokk, V.; Laur, S.; Kamm, L. Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification. *Cryptography* **2021**, *5*, 21.
https://doi.org/10.3390/cryptography5030021

**AMA Style**

Ostrak A, Randmets J, Sokk V, Laur S, Kamm L. Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification. *Cryptography*. 2021; 5(3):21.
https://doi.org/10.3390/cryptography5030021

**Chicago/Turabian Style**

Ostrak, Andre, Jaak Randmets, Ville Sokk, Sven Laur, and Liina Kamm. 2021. "Implementing Privacy-Preserving Genotype Analysis with Consideration for Population Stratification" *Cryptography* 5, no. 3: 21.
https://doi.org/10.3390/cryptography5030021