Future Preventive Gene Therapy of Polygenic Diseases from a Population Genetics Perspective

With the accumulation of scientific knowledge of the genetic causes of common diseases and continuous advancement of gene-editing technologies, gene therapies to prevent polygenic diseases may soon become possible. This study endeavored to assess population genetics consequences of such therapies. Computer simulations were used to evaluate the heterogeneity in causal alleles for polygenic diseases that could exist among geographically distinct populations. The results show that although heterogeneity would not be easily detectable by epidemiological studies following population admixture, even significant heterogeneity would not impede the outcomes of preventive gene therapies. Preventive gene therapies designed to correct causal alleles to a naturally-occurring neutral state of nucleotides would lower the prevalence of polygenic early- to middle-age-onset diseases in proportion to the decreased population relative risk attributable to the edited alleles. The outcome would manifest differently for late-onset diseases, for which the therapies would result in a delayed disease onset and decreased lifetime risk; however, the lifetime risk would increase again with prolonging population life expectancy, which is a likely consequence of such therapies. If the preventive heritable gene therapies were to be applied on a large scale, the decreasing frequency of risk alleles in populations would reduce the disease risk or delay the age of onset, even with a fraction of the population receiving such therapies. With ongoing population admixture, all groups would benefit over generations.


Introduction
Research into the causality and liability of diseases primarily based on familial and populational observations greatly pre-dates the discovery of DNA structure and the genetic code in 1953 by Watson and Crick [1]. Initially, it was only possible to estimate the frequency of highly malignant mutations in human populations [2]. It took several decades for experimental techniques to develop sufficiently to sequence the human genome [3]. Whole genome sequencing (WGS) and genome-wide association studies (GWASs) have provided experimental insights into the genetic architecture of polygenic diseases that could be only hypothesized a decade or two earlier [4].
The search for singular genetic mutations started decades ago and continued with GWASs and WGS, which led to the discovery of many thousands of highly malignant so-called Mendelian conditions. Among such conditions are sickle-cell anemia, Tay-Sachs disease, cystic fibrosis, hemophilia, thalassemia, Huntington disease, early-onset Alzheimer's disease, and macular degeneration, as well as mutations in the BRCA1/2 genes, which are causally linked to multiple types of cancer, especially breast cancer [5]. On its own, the prevalence of each such disease in the population is relatively low. The mutations that cause the majority of Mendelian conditions are known and usually involve single nucleotide variants (SNVs) that are associated with a high susceptibility to these diseases, with other sequence rearrangements representing an aggregate 13% of mutations [6,7]. The OMIM Gene Map Statistics [5] database lists over 4000 of such gene mutations responsible for almost 6500 phenotypic conditions or syndromes, and The Human Genome Mutations Database [8] lists more than 250,000 disease-causing mutations. It has been estimated that, on average, an individual carries 0.58 recessive alleles that can lead to complete sterility or death by reproductive age when homozygous [9]. The fact that this number is an average of a large variety of very rare mutations distributed throughout the genome indicates that severe events, which occur when these rare alleles affect a particular gene pair in one descendant, are an infrequent occurrence. However, in aggregate, less malignant diseases caused by rare mutations affect a noticeable fraction of the population, with approximately 8% of individuals affected [7,10].
Tests have been conducted on many experimental gene therapy techniques that target diseases typically caused by a single defective gene or SNV. Ginn et al. [11] identified 287 trials that had been performed by the end of 2017 on inherited monogenic disorders, with the overall number of clinical trials of gene therapies, predominantly in the oncology field, exceeding 2600. Philippidis [12] summarized 25 gene-editing therapies that were under clinical trial during the first quarter of 2019. All therapies in these studies focused exclusively on the clinical or reactive-rather than prophylactic-treatment of genetic conditions. Although not yet technologically or medically possible, the potential of applying germline gene-editing therapy to prevent at least some of these diseases is being increasingly discussed. Public understanding of the expected health benefits of such therapies is gradually building [13,14], and is notably present in the recommendations of the UK Nuffield Council on Bioethics [15] report Genome editing and human reproduction: Social and ethical issues (2018). Hypothetically, when the medical technology becomes available to safely and accessibly correct these mutations, and if governmental regulations allow it in the future [16], treated individuals and their descendants (in cases of heritable gene therapies) will be effectively cured and have no need for concern about the single specific cause of their disease.
In contrast to Mendelian conditions, polygenic or complex disease liability is attributed to hundreds and thousands of gene variants or single nucleotide polymorphisms (SNPs) of typically small effect that, in combination, constitute the polygenic disease risk of an individual [17][18][19]. The polygenic risk score (PRS) of an individual at higher risk for a polygenic disease reflects the presence of a higher number of detrimental gene variants [20] relative to the average distribution of common gene variants in the population. Polygenic diseases include highly prevalent old-age diseases-termed late-onset diseases (LODs)-that eventually affect most individuals (for example, cardiovascular disease, particularly coronary artery disease, cerebral stroke, type 2 diabetes, senile dementia, Alzheimer's disease, cancers, and osteoarthritis) [21][22][23][24][25][26][27][28], as well as earlier-onset diseases and phenotypic features such as susceptibility to asthma and psychiatric disorders and particular height and high body mass index (BMI) characteristics [4]. Over the past ten years, GWAS results have been reported for hundreds of complex traits across a wide range of phenotypes. These studies have led to a well-established consensus that a large number of common low-effect variants can explain the heritability of the majority of complex traits and diseases [4,29,30]. With increasing cohort sizes and improving analysis methods, GWASs are finding ever larger sets of SNPs associated with polygenic traits. GWASs still can explain only a fraction of disease heritability; however, the systematically collected SNP correlations provide a good indication of the expected effect sizes and allele frequency distribution of as yet undiscovered SNPs [18]. Research provides strong support for multiplicative effects of common SNPs and their environment interaction [31,32]. According to Chatterjee et al. [33], "to date, post-GWAS epidemiological studies of gene-environment interactions have generally reported multiplicative joint associations between low-penetrant SNPs and environmental risk factors, with only a few exceptions," and "investigations of SNP-by-SNP and SNP-by-environment interactions using data from large GWAS generally suggest that the assumption of multiplicative effects is often adequate." Geographic and local population genetic stratification and variation complicate the ability to diagnose and treat medical conditions [34] (for additional exposition, see Appendix A.1). The predictive utility of GWAS and GWAS PRSs also varies broadly if the risk score is applied to a population other than the one for which the score was initially determined [35][36][37]. At the same time, there are many indications of the commonality of causal gene variants for polygenic diseases among geographically distinct populations [38,39], while admixed populations present an intermediate liability to diseases [40][41][42]. A study by Zanetti and Weale [43] found that a combination of Euro-centric SNP selection and between-population differences in linkage disequilibrium and effect allele frequencies was sufficient to explain the rate of previously reported trans-ethnic differences, without the need to assume between-population differences in the true causal SNP effect size, suggesting that the cross-population consistency is larger than that usually reported.
Even when the majority of causal gene variants are common among populations, they are difficult to match precisely in genetically stratified populations for two main reasons. First, the GWAS PRS is composed of representative so-called "tag" SNPs. Rather than being true causal variants, tag SNPs are from a genomic region that exerts a single or combined effect of multiple detrimental and protective SNPs in various degrees of linkage disequilibrium and varying allele frequencies in different subpopulations [44,45] (see Figure 1A). Thus, although only a small fraction of true causal SNPs for each polygenic condition have been identified, PRSs can be determined since they rely on an aggregate of implicit determinations that are likely to significantly differ among the population-specific background of non-causal SNPs [44]. The second reason that underlies this challenge is that, in addition to differences in SNPs, there are less-researched structural variations that differ among populations and can influence disease liability [46]. Major projects are underway that aim to comprehensively catalog the detrimental structural variation in diverse populations [47]. In parallel, the advancement of biomedical techniques will facilitate the detection of germline structural variants for clinical validation and research in the future [48].
For LODs, a combination of genetic liability, environmental factors, and the physiological decline of multiple organ systems leads to individual disease presentations [27]. Earlier research evaluated the risk allele distributions that accompany aging for polygenic LODs [49], and, leveraging age-specific incidence rates under Cox's proportional hazards model [33,50], quantified the potential of future preventive gene therapies to delay the onset age and reduce the lifetime risk of such LODs [51]. This is demonstrated in Figure 1B. A recent clinical data analysis confirmed these theoretical predictions [52].
The polygenic diseases with highest incidence in early-and middle-age that are the focus of the current research, are exemplified by asthma [53,54], chronic migraine [55,56], Dupuytren's disease [57], rheumatoid arthritis [58], lupus erythematosus [59], schizophrenia and bipolar disorder [60], and Crohn's disease [61,62]. The lower prevalence of these diseases contrasts with the high prevalence of some LODs, highlighting differences in their evolutionary and causal manifestations [63]. These diseases are less suitable for the age-specific rates approach [51] because subjects with an earlier age at disease onset do not necessarily show an increased polygenic risk burden, as exemplified by schizophrenia incidence [64]. The liability to these diseases is often illustrated using the liability threshold model proposed by Falconer [65] (see Figure 1C,D).
In this study, computer simulations were used to evaluate the magnitude of the heterogeneity in alleles causal for polygenic diseases that could exist among geographically distinct populations. Population genetics simulations were performed for representative scenarios of preventive gene therapies designed to turn true causal alleles into a naturally existing neutral state of nucleotides for polygenic Early-to Middle-age-Onset Diseases (EMODs), and evaluated the disease prevalence reduction and the progression of population admixture that would accompany such therapies. The combination of these EMOD findings with earlier published LOD conclusions resulted in a comprehensive picture of preventive polygenic disease gene therapy from a population genetics perspective.  [65,66]. Under this model, the disease prevalence is a function of the disease liability (as termed by Falconer), which can be understood as the polygenic risk score of true causal gene variants. For Population B, the area to the right of the liability threshold is larger, as is the disease prevalence; the vertical liability threshold line is the initial Falconer interpretation for illustration purposes. Modern approaches can be perused in [67]; (D) Falconer's liability threshold model with the same mean liability and different liability variances. If both distributions are normalized, the prevalence will be larger for a wider variance, particularly distinct for smallest prevalence values, and it will remain identical between populations A and B at a prevalence of 50%.

Admixture of Populations with Matching Mean PRSs: To What Extent Can Causal Risk Alleles of Polygenic Diseases Differ between Populations?
The first set of simulations evaluated the blending admixture of two simulated populations with equal liability to a disease. The disease heritability was set at 50%, the mid-range heritability of polygenic diseases [68,69]. The disease SNP sets were built using the common low-effect genetic architecture, and the population genetics simulation progressed through generations. Four simulated scenarios, in which the combined effect of SNPs differed between the populations by 100%, 65%, 33%, and 20%, were considered.
The simulations recorded the changes in the variance of the population PRS and disease prevalence as generations progressed. The simulated diseases were polygenic EMODs, which are model polygenic diseases whose maximum incidence occurs at young-to middle-age, with a negligible incidence at older ages. In this publication, the term "prevalence", used in reference to EMODs, always means the prevalence at an age later than the typical age of onset range.
The results presented in Figure 2 show that, for all scenarios of differing SNP architectures, the PRS variance gradually increased starting from the second admixed generation, and it continued to increase in subsequent generations. The consequences of this pattern are illustrated in Figure 1D. The variance rise was gradual, resulting in the fifth generation in a 3% increase in prevalence for the scenario in which all causal SNPs differed between the populations, and it increased by just a fraction of a percent for the scenario in which one-fifth of causal SNPs differed. By the 25th generation, the prevalence values for the highest and lowest differences in genetic architecture causality scenarios were 1.12% and 1.03%, or, in relative terms, a 12% and 3% increase above the prevalence in the populations before admixture. These results are summarized in Table 1. The gradual increases in variance and prevalence were due to gradual recombination of the population genome. Figure 2B,D show the result of accelerating the recombination to 1000 crossovers per genome per generation. In this figure, the population risk variance and prevalence approach the equilibrium within a few generations. This increase in variance with the admixture of diverse populations was previously reported with much smaller magnitudes of causal allele stratification based on actual allele frequencies in human populations [70].  This phenomenon can be simplistically explained using an example of two risk alleles, each unique to one of two identically sized populations with identical disease risk. When these populations blend together, the frequency of risk alleles is expected to be average in the resulting population, with the resulting average effect size, or PRS, remaining unchanged. At the same time, following Equation (A1), the sum of the variance will increase relative to each initial population. As illustrated in Figure 1D, this will cause the risk probability distribution to widen, leading to increase in the risk of low-prevalence diseases, with no change at all for diseases with a prevalence of 50%. The simulated values are for a disease with early-to middle-age onset, 50% heritability, and a 1% prevalence/lifetime risk. The relative prevalence increase is calculated in comparison to the baseline prevalence, where, for example, the prevalence increase from 1% to 1.45% represents a 45% relative increase.

A C
The results of these simulations suggest that the true causal SNPs of polygenic diseases may easily differ by more than 30%, perhaps even by up to 100%, between geographically stratified populations, and clinical or epidemiological observations will be unlikely to register small and gradual increases in disease prevalence over successive generations because of the increase in the combined variance of a large number of risk alleles. A simulation of accelerated recombination, with 1000 recombinations per parental generation genome, resulted in the equilibrium level being reached within a few generations with a maximum relative prevalence increase by 45% when all SNPs differed between the original populations and 6.3% relative increase when one-fifth of the SNPs differed. However, it would take many generations to reach this equilibrium in real populations, and, on such a timescale, this process is likely to be indistinguishable in clinical practice from ongoing admixture with other populations and confounded by genetic drift, mutations, selection, stratification, environmental, and lifestyle changes.

Admixture of Populations with Differing PRSs
This scenario evaluated the admixture of two populations with similar polygenic EMOD architectures, where the higher-risk Population 2 was characterized by a common frequency of a small subset of alleles that had a very low frequency in Population 1, giving Population 2 an average relative risk (RR) of 10.0 (PRS difference in units of log(RR) = 2.30), as displayed in Figure 3. Accordingly, the initial disease prevalence was equal to 0.1% for Population 1 and 1% for Population 2. As expected from the conclusions of the preceding section, the relative PRS variance between the two initial populations before admixture differed by just 1.1%, even with the 10-fold difference in disease risk between the populations. This population liability is almost exactly reflected in Figure 1C, but not Figure 1D. The PRS effect size after admixture settled at the average between the two original populations, as is typical of the observational reports cited in the Introduction. The variance level of the combined population stabilized closer to the variance of the higher-risk Population 2, as would be expected from Equation (A1), with a negligible effect on the disease prevalence. Figure 3A,B show that the normalized PRS effect size difference between the populations accounted by the simulation almost exactly follows the proportion of population mixing under all admixture scenarios. This behavior matches the reported polygenic disease risk averaged in proportion to the population admixture noted in the publications referenced in the Introduction and Appendix A.1. While the admixture of two equally sized populations results in a precisely averaged PRS, the prevalence after mixing is close to the geometric mean of the initial prevalence values, resulting in a smaller-than-arithmetic average of the prevalence values of the initial populations. Thus, in this example, the prevalence is 0.32% rather than 0.55% (see Figure 3C,D). The PRSs will generally equalize following a simple mixing equation; this is true for both EMODs and LODs, as follows: (1) In the calculation for Population 2, β 2 (g) is the effect size (PRS) of Population 2 in generations g + 1 using values from the previous generation g. In this case, 0.5 is the ratio for equal population sizes, m is the admixture proportion, and β 1 is the effect of Population 1; the equation for Population 1 mirrors Equation (1).    (2) shows Population 2. Population 1, which was used as the reference, had a mean PRS of 0.00 and an initial prevalence of 0.1%. Population 2 is the higher-risk population and had an initial PRS of 2.30 and an initial prevalence of 1%. The plots (C,D) show the corresponding disease prevalence change. Figure A2 shows a graphical display from the simulation and illustrates the admixture between these two populations.
It is interesting to note that, even with a low 10% population admixture rate, the non-participating Population 2 prevalence decreases, relative to the baseline, to 91% in one generation, 84% in two generations, and 77% in three generations, and the improvement is even faster at higher admixture rates, with both populations heading toward an asymptotic admixed prevalence of 32%. Prominently, the equalization is reached in one generation in the 100% blending scenario (shown by the red lines in Figure 3).

Lowering Polygenic Disease Prevalence by Editing Effect SNPs
The gene therapy operations would change detrimental SNPs frequency in some fraction of a population. The population-wide Hardy-Weinberg equilibrium will be reached after one generation of random mating in an indefinitely large population with discrete generations, in the absence of mutation and selection, and the frequency of genotypes will remain constant across generations [71,72]. In case of high heterogeneity in effect alleles between populations, it may take a number of generations for the allele distribution to homogenize, accompanied with an increase in disease prevalence, as was described in Section 2.1. This effect is barely detectable for smaller risk allele differences, as modeled in the previous Section 2.2.
Simulations confirmed that modifying or turning off a number of causal alleles in a higher-risk population can easily reduce the risk to that of a lower-risk population. Additionally, treating, for example, half of the individuals in a population with double the number of corrected SNPs (or any other proportion, as long as there are enough SNPs to correct) produces the same population risk load reduction, as the corrected SNPs would distribute within a few generations of random mating. Figure 4 demonstrates this by starting with a homogeneous population with identical risk in generation 0, subdividing individuals into two equally sized populations, and lowering the average RR of Population 1 by 10-fold (PRS = −2.3). The result is equivalent to those described for Population 1 and Population 2 in the previous Section 2.2, as shown in Figure 3, and is followed by an identical admixture pattern. The variance of the combined population after admixture diminishes by 0.9%, reflecting the lower frequency of the risk alleles in the population.  Figure 4. Admixture of two populations following a gene therapy resulting in a 10-fold relative risk difference. The homogeneous population was divided into two equal size populations. The initial disease prevalence was set at 1%, and the disease heritability was 50%. In Population 1, an individual's SNPs were uniformly edited to achieve a 10-fold improvement in relative risk (RR) (PRS = −2.30) in generation 1. Population 2 prevalence remained at the initial level in generation 1, while the prevalence of Population 1 decreases to 0.1%. After that, admixture patterns become mirror images of those in Figure 3. The plots in (A,B) show that the population mean polygenic risk score (PRS) equalizes between the two populations, depending on the admixture rate. The plots in (C,D) show a corresponding change in the mean disease prevalence.

Estimates of Population Genomic Parameters for Diseases Known to Have Large Risk Differences between Ethnic Groups
Many diseases differ in terms of their risk and prevalence among subpopulations. In reviewed published cases, admixed populations were shown to have intermediate liability. Examples include differences in nicotine metabolism between Maori and European populations [41], differences in type 2 diabetes (T2D) risk between European American and African American populations [73], and differences in atrial fibrillation risk among a variety of populations [74], with prevalence usually differing by less than 2-fold between affected populations.
Three examples of diseases with contrasting risk between populations, primarily for middle-age-onset, are Dupuytren's disease (DD), rheumatoid arthritis (RA), and lupus erythematosus (LE). DD heritability was determined by Larsen et al. [75] as 80%, with extremely varied prevalence, affecting at older ages 22-32% of men in populations originating from Northern European countries [57], and significantly lower prevalence in populations from other origins, with the lowest prevalence in Korea [76], Taiwan and China [77] at 100-1000 times lower prevalence than in Northern European populations. According to Molokhia and McKeigue [78], West Africans have a higher risk of LE than Europeans, and Native Americans have a higher RA risk than Europeans. Both diseases also show intermediate risks in admixed populations. LE heritability is estimated to be 44% [59,79], the prevalence was reported to be 0.35% for 60-year-old African American women and 0.1% for European American women [80]. RA heritability is estimated to be 60% [58]; it has a prevalence of 3% in Canadian Native Americans and 0.3% in Europeans [81].
The above three examples were specifically chosen because their maximum incidence rates occur in early to late-middle ages. Therefore, prevalence of the diseases at moderately old ages approaches the disease lifetime risk. The admixture simulation results are presented in Table 2 and graphically illustrated in Figure A3. Results of admixture of two equal size populations differing in the prevalence of early-to middle-age-onset diseases and the estimated SNP corrections required to achieve disease parity. Disease abbreviations: DD-Dupuytren's disease; RA-rheumatoid arthritis; LE-lupus erythematosus. Pop 1 has a higher disease prevalence, and Pop 2 has a lower disease prevalence. The term "Relative Risk" describes the number of times by which the prevalence differs between Pop 1 and Pop 2. The average SNP effect is expressed in units of natural log(RR), a combination of alleles with varying effects and frequencies, with an average RR value of 1.1 in this instance. As described in the Methods section, "SNPs in Disease Architecture" is the total number of SNPs in the genetic architecture responsible for disease heritability.
The last three columns in Table 2 show the differences in PRSs between populations (in units of log(RR)) and the average number of SNPs at the average genetic architecture effect size in need of correction to match the risk in high-risk populations with that in lower-risk populations if such a therapy were possible. It is shown that DD would require 89 SNPs to be corrected to reduce the high risk in North European ethnicities to match that in the Korean population. RA and LE would require significantly fewer edits. In each case, the number of edits constitutes only a small fraction of SNPs in each disease's common low-effect genetic architecture.
The values of the admixed prevalence of RA and LE closely follow the geometric mean of the initial populations, as established in Section 2.2. The simulation results noticeably deviate from the geometric mean in the case of DD, for which the geometric mean √ 0.25 × 0.0025 equals 2.5%, rather than the value of 4% found by the simulations. This indicates that 25% can hardly be considered a low prevalence from the perspective of relative risk, particularly when considering large risk differences between populations. Further simulation of scenarios with more common lower differences in disease relative risk between populations showed that prevalences after admixture closely followed the geometric mean of two initial populations; however, based on the assumption in Methods, the model is better confined to prevalences in single digits and below, typical to EMODs.

An Estimate of Preventive Gene Therapy for Early-to Middle-Age-Onset Polygenic Diseases
The review of the three diseases above-DD, RA, and LE-estimated the differences in the number of SNPs related to disease risks in naturally occurring populations and, accordingly, differences in the number of SNP corrections that would be required to achieve population parity for these EMODs.
Following the evaluation of population stratification by disease risk, admixture, and a simple correctional edit followed by population admixture in Sections 2.2 and 2.3, it is time to consider a scenario that could allow for broader extrapolations. There can be countless potential scenarios of therapy levels, stratification, and admixture. It can be hypothesized that there may be an optimal level of population EMOD risk that can be achieved by lowering the average population PRS or, equivalently, by lowering the true causal risk allele frequencies.
A scenario was chosen in which, for the individuals participating in gene therapy (Population 1), the required number of risk SNPs was therapeutically edited to lower the population relative risk by 10-fold, or by a PRS of β = −2.3, in the first generation of ongoing therapy, on the premise that a 10-fold risk reduction in any disease would be a commendable improvement. Subsequently, smaller therapeutic interventions were applied in each generation to maintain Population 1 at this optimal level; the number of edits per generation is shown in Figure A4.
The evaluation of the admixture scenarios for Population 2, which does not directly participate in gene therapy (see in Figure 5), shows that, in the 100% admixture (blending) scenario, the disease prevalence in Population 2 to plummets to 0.32% (or 32% of the prevalence baseline value), while the population PRS reaches the exact halfway point between values in the original populations.
However, unlike the admixture scenarios presented in Sections 2.2 and 2.3, the improvement continues to asymptotically progress toward the treated Population 1 level of 10% of the baseline disease prevalence. The PRS progression using Equation (1) would just require fixing β 1 (g) = Const-the level of the chosen optimal treatment. From the perspective of the PRS admixture, this result is equivalent to the basic island-continent migration model; however, the disease prevalence connotations are noteworthy. Figure A5A also shows the renormalization of the relative PRS that can be applied to estimates with any chosen initial values of relative risk improvement, and in Figure A5B the normalized prevalence progression in case of the RR = 10 treatment level. For comparison, the therapy alleviating population relative risk 4-fold depicted in Figure A5C showed that the relative prevalence reduction for the non-participating populations with ongoing admixture, as compared to the treated population, would be similar for varying degrees of treatment.

Discussion
With the accumulation of scientific knowledge of the genomic causes of common diseases and the advancement of gene-editing technologies, gene therapies to prevent polygenic diseases may soon become a reality. GWAS research over the past decade has ascertained that polygenic EMODs and LODs share a genetic risk architecture: their causality is primarily attributable to common low-effect alleles [4,30] in multiplicative joint associations with environmental risk factors [33]. With the application of the multiplicative genetic risk model, the computer simulations developed in this research mapped the polygenic risk of the model genetic architecture of EMODs based on their prevalence and heritability into individual disease probability. The results of these simulations correlated well with epidemiological observations (see Appendix A.1). Simulations of the admixture between modeled populations using this framework were performed to investigate a hypothetically possible range of heterogeneity of causal SNPs in geographically distinct populations. Subsequently, these simulations were applied to model scenarios of gene therapies to assess the relationship between population admixture and disease prevalence throughout generations.
The simulations of admixture with differing causal SNPs between populations with identical disease prevalence demonstrated that, in principle, even a large degree of heterogeneity in causal allele sets for EMODs between populations would be difficult to detect. Whether all causal SNPs were identical or whether a large fraction of them differed between a pair of populations, the epidemiological and clinical statistics would be practically indistinguishable. Equally, it was shown that the outcomes of gene therapies would not be impeded under either situation. The commonality of causal gene variants for polygenic diseases between geographically distinct populations, as reported by GWASs [38,39,82] (with some models exploring a larger extent of allelic heterogeneity [83]), makes this extreme difference in causal allele sets unlikely, and the differences in disease prevalence and disease manifestation between populations appear to be primarily caused by differences in common allele frequencies.
The finely balanced risk of genetic architecture in this model scenario would be far exceeded by the actual risk differences in geographically distinct populations, which often differ in disease prevalence [84]. The simulated population admixture for all polygenic diseases with differing risks among populations resulted in arithmetic averaging of the PRS, expressed as the sum of logarithms of the causal alleles' true relative risk, and the prevalence of EMODs followed the geometric mean of the original populations.
The extreme differences in common EMOD risk, exemplified by DD, LE, and RA, demonstrate the range of polygenic distribution differences that may develop between populations due to geographic separation that occurred within an evolutionarily short time. Furthermore, these differences indicate the potential to alleviate risks of these and other polygenic diseases using gene therapy. The simulation results for typical EMODs show that the disease prevalence decreases in proportion to the degree by which the treatment lowers the population average relative risk.
It is hard to imagine that, even if such gene therapies were available, everyone would participate. In the hypothetical scenarios in which populations admix at a low rate of 10%-which would not be typical, particularly in the Americas [84]-the prevalence rates of the targeted diseases in the fraction of the population not directly receiving gene therapy would noticeably decrease in the second generation and even more so in subsequent generations. Longer term, this admixture would lead to a lower and more equal disease risk for all populations. A hypothetical example of such group stratification with regard to preventive gene therapy is preventive genetic treatment during in vitro fertilization (IVF), which could be legislatively limited only to situations in which the parents were found to possess high PRSs of a polygenic disease [14]. In the first generation, only the direct recipients would benefit, but normal admixture over the scale of generations would cause the whole population's disease prevalence to diminish, as the simulations in this research demonstrate.
Again, hypothetically, even if gene therapy were to be discontinued after significantly reducing the risk of Mendelian diseases and EMODs over time, the low human germline mutation rate (estimated to be on average 1.18 × 10 −8 mutations per nucleotide per generation, which corresponds to 44-82 mutations per individual genome with an average of only one or two mutations affecting the exome [85]), means that many generations would pass before the disease rates would significantly increase again [86][87][88].
A complete picture of polygenic disease prevention must include LODs. The analysis method applied to EMODs would not be valid for polygenic LODs because LODs typically manifest with extremely low incidences of diagnosis at younger ages, followed by a period of a nearly exponential annual increase in the disease incidence rate starting at relatively older and LOD-specific ages [49]. According to Chatterjee et al. [33], the conditional age-specific incidence rate of the disease can be modeled using Cox's proportional hazards model [50] and multiplicative joint associations between low-penetrant SNPs and environmental risk factors [33]. An evaluation using this model [51] showed that a moderate level of therapy that lowered the hazard ratio by 4-fold (OR = 0.25) by converting detrimental SNPs to a neutral state would result in lifetime risk reduction by 30-54% for AD, T2D, CAD, and stroke, and 59-73% improvement for the analyzed four cancers, as long as mortality from all causes remained constant. With increasing longevity, this corresponded to a delayed onset of LODs, with a delay of about three years for AD; between 10-15 years for T2D, cerebral stroke, and coronary artery disease (CAD); and an even longer onset delay for breast, prostate, colorectal, and lung cancers.
A recent clinical and GWAS analysis by Mars et al. [52] determined that the difference in age at disease onset between the top and bottom 2.5% fraction of PRSs was 6-13 years for four LODs that overlapped with Oliynyk [51]. A lower onset difference value was found to be characteristic of T2D and CAD, while breast and prostate cancers showed the highest differences in terms of age of onset, thus clinically confirming the patterns predicted by simulations in [51]. The naturally occurring difference in the age of onset for the top and bottom fractions of the natural PRS variation [52], in principle, shows that applying gene therapy that would turn a sufficient number of true causal SNPs into neutral SNPs, thus turning the high risk population into the low risk population, would have the predicted outcome reflected in years of a delayed LOD onset.
The current research confirms that, for polygenic diseases, including LODs, if gene therapy were to lower the frequency of true causal risk alleles and the corresponding population PRS, these proportions would propagate throughout subsequent generations [72]. In the case of admixture with populations not directly participating in gene therapy, the PRS would distribute proportionately to population mixing ratios, which for LODs will be reflected in disease onset delay [51] for all beneficiary generations. The incidence of EMODs does not strictly stop at a particular age; rather, a later but lower disease incidence occurs for all EMODs referenced herein. Therefore, preventive genetic treatment of these conditions may to a degree result in a delay of disease onsets.

Methods
This study assessed population genetics dynamics for a hypothetical future in which gene therapy can be applied to prevent polygenic diseases. In earlier research, the risk allele distribution for polygenic LODs that accompanies aging was evaluated [49], and the potential of future preventive gene therapy to delay onset ages and lower the lifetime risk of developing such LODs was successfully quantified [51], as demonstrated in Figure 1B, by leveraging age-specific incidence rates under multiplicative [33] Cox's proportional hazards model [50]. The findings of this earlier publication complement the results of the current research and are noted in the Discussion.
The main goal of this study was to quantify the impact of gene therapy from a population genetics perspective while accounting for population stratification and admixture. The gene therapy corrections that change detrimental SNPs frequency within a subset of a population will reach population-wide Hardy-Weinberg equilibrium after one generation of random mating in an indefinitely large population with discrete generations, in the absence of mutation and selection, and the frequency of genotypes will remain constant throughout generations [71,72]. This equally applies to polygenic phenotypes [89], and the extended diploid Wright-Fisher model simulation reproduced this expected behavior, thus validating that the model's granularity on a generational scale was appropriate for the intended target of this research. Although the mean population PRS found in this study precisely follows the Hardy-Weinberg principle, the behavior of disease risk variance in the polygenic admixture is more gradual as a result of linkage disequilibrium and recombination [70,90,91].
The following sections review the simulation's conceptual foundations and conclude by describing the simulation steps.

Considerations for Liability Threshold Models
Of the polygenic diseases analyzed in this research, those with the highest incidence in early-and middle-age are less suitable for the age-specific rates approach used earlier for LODs [51] because subjects with an earlier age at onset do not necessarily show an increased polygenic risk burden, as exemplified by the incidence of schizophrenia [64]. The prevalence of these diseases is sometimes modeled using the liability threshold model, originally proposed by [65,66]. Under this model, illustrated in Figure 1C,D, the disease prevalence is a function of disease liability, which is represented by polygenic risk. In the liability threshold model, an individual can be characterized by a genetic liability to a disease. A combination of genetic and environmental effects results in a probabilistic disease distribution among individuals. In the original Falconer [65] interpretation, all individuals whose PRS exceeds the threshold contribute to the disease prevalence; graphically, these individuals fall to the right of the threshold. Subsequent research has shown that the multiplicative risk model is most suitable for explaining experimental data. This model is exemplified by three approaches: the Risch risk model, the odds risk model, and the probit risk model [67,92,93]. The solutions based on these models are typically obtained through simulations or numerical methods, with the exception of the simplest scenarios that allow for analytic solutions, providing estimates of disease prevalence according to the polygenic risk distribution. These models lack the ability to sample individuals in the multi-generation population simulations required in this study, and they are also based on specific allele distributions that will not be maintained during ongoing admixture and gene therapy. Hence, this study developed the simulation approach described in Section 4.4, applying probabilistic sampling of individuals by PRS validated in [51].

Conceptual Summary
The simulated diseases were assumed to have an early-to middle-age onset, with a negligible disease incidence at older ages. The term "prevalence" is customarily used in liability threshold models. However, often, whether the term pertains to a whole population or a population of a certain age range is not well defined. Herein, the term is used in a narrower scope; in this study, "prevalence" means the cumulative incidence of a disease at an age later than the typical onset age range, with negligible incidence later on. Thus, the definition of prevalence in this context is more similar to the lifetime risk concept.
The heritability of EMODs usually ranges from 30% to 80%, as documented by Wang et al. [68] and Polubriaginof et al. [69]. A heritability level of 50% was chosen for most simulations and analyses to represent a typical EMOD, and the common low-effect-size genetic architecture SNP set was assembled accordingly, as noted in Section 4.3. The analysis of specific EMODs used their heritabilities.
Large population sizes were used to make genetic drift effects imperceptible at the short generational scale used in the simulations. Similarly, although the simulation design allowed for the introduction of mutations, given the short generational scale under consideration, mutations could not achieve common population frequency [86][87][88] and were not introduced.
This study was not concerned with evaluating potential obstacles due to pleiotropy, which, in the context of gene therapy, is defined as the possible negative effects on other phenotypic features resulting from an attempt to prevent an EMOD by modifying a subset of SNPs [94,95]. Under the common low-effect genetic architecture used in the simulations, from an average of 514 such SNPs in the average modeled individual (as shown in Figure A1A), gene therapies would only need to correct an average of 15 SNPs to achieve a 4-fold decrease in the relative risk (PRS = −1.386) and 24 SNPs to achieve a 10-fold RR decrease (PRS = −2.30). Arguably, with personalized prophylactic treatment, it would be possible to select a small fraction of variants from a large set of available choices, as exemplified in Table 2 that do not possess antagonistic pleiotropy, or perhaps even select SNPs that are agonistically pleiotropic with regard to some of the other EMODs and LODs. After all because of a balance between selection, mutation, and genetic drift on evolutionary scales [87], a proportion of low-effect detrimental SNPs have achieved common population frequency, simply because they were not detrimental enough to have been selected out, rather than having been selected for because they provide a physiological or survival benefit. Thus, these SNPs would constitute an uncontroversial therapeutic target.
In the simulations, the F-statistic (Fst) for disease architecture alleles was calculated using Hudson's method, as recommended by Bhatia et al. [96], and the alternative allele frequency difference (AFD) statistics were also calculated [97]. The statistics obtained were unsurprising for the simulated populational processes, and including their interpretation in the reported results would be extraneous. Nevertheless, for those interested, these results are available in Supplementary Data. While admixture naturally involves multiple world populations, simulating the admixture of two populations was adequate for the intended analysis and extrapolations.
The analysis in this study is contingent on future genetic and computational techniques being capable of determining and safely modifying a relatively small subset of disease genetic architecture SNPs from a detrimental state to a neutral one. This is easy to accomplish in a population simulation, in which the effect sizes and states of detrimental SNPs are known for each individual. These model genetic architecture SNPs are treated as variants that are truly causal for disease liability and heritability. A brief summary of current gene-editing technologies is included in Appendix A.2.

Allele Genetic Architecture
The common low-effect-allele architecture was implemented in a similar manner to that used in the author's earlier research [49], which followed the approach used by [17]. The summary, including specifics of the implementation in this study, is available in Appendix A.3. In contrast to GWAS tag SNPs, the model genetic architecture SNPs are truly causal for disease liability and heritability variants, and they are assumed to be accurately identified for the purposes of personalized gene therapy. Estimates using the liability threshold model customarily use RR values to model known causal SNPs [67,98]. This research followed suit: SNP effects were treated in terms of relative risk, and PRSs were expressed in terms of the sum of the logarithm of RR. This method is also justified by the fact that the majority of EMODs have a prevalence of less than 2%, as exemplified by RA [58], LE [59], schizophrenia and bipolar disorder [60,99,100], and Crohn's disease [61,62], with only a small number of diseases such as asthma [53] approaching a prevalence of 10% [68]. Dupuytren's disease, which has a prevalence of more than 30% in some Northern European ethnicities, although it is lower in most of the world by 1-3 levels of magnitude, is an interesting example that was examined in this research. The alleles were randomly distributed throughout the model genome; these results are consistent with GWAS findings for asthma [53,101], schizophrenia [102], and other diseases [4].

Disease Prevalence Analysis
In order to track the changes in disease prevalence associated with population admixture and gene therapy, it was necessary to map PRSs to the probabilities of succumbing to a polygenic disease on the basis of the genetic architecture and disease prevalence. Individual RRs R i were calculated as a product of the RRs of all SNPs in the disease genetic architecture, as follows: where r k is the kth SNP's true RR, and a k i (equal to 0, 1, or 2) is the number of the kth allele in a pair of individual chromosomes i. The PRS β i = log(R i ) is defined in Appendix A.3. Multiplicativity by RR is equivalent to additivity by PRS. The simulations sampled individuals from the allocated population without replacement, proportionate to individual RR R i , until a sample size of n individuals-those diagnosed with the disease-reached the number that satisfied the disease prevalence K: The goal was to map an individual's PRS to the probability of them becoming ill on the basis of disease prevalence and PRS distribution, dictated by heritability and allele genetic architecture, as follows: In practice, the simulation loop sorted the sampled diagnosed individuals into narrow PRS intervals, from β to β + ∆β, and determined the probabilities π of each PRS band, as follows: where i β to i β+∆β are numbers of individuals sorted by PRS in a PRS band, and N is the population size.
Thus, under the multiplicative risk model, an individual's probability of being diagnosed with the disease under consideration can be mapped to the individual PRS, and this mapping can be used in subsequent generations in conjunction with gene therapy and population admixture. The advantage of this approach is that once the mapping is determined, it can be saved and reused in subsequent simulation runs as long as the chosen initial genetic architecture and prevalence are identical. This initial mapping was made very accurate by building large sets of individual PRSs per run of determination simulation (a set of eight billion was typically used) and averaging the mapping over multiple runs. The resulting mapping distribution is shown in Figure 6. Corresponding vertical lines -mean PRS of diagnosed population Figure 6. Disease probability distribution mapped to individual PRS. In simulations for a population with a mean PRS normalized to zero and a heritability of 50%, the PRS probability of disease curves reproduced the liability threshold model's logistic distribution of probabilities [103]. This PRS probability distribution allows for the precise reproduction of the original disease prevalence and is used to determine changes in prevalence that result from simulated population admixture and gene therapy. The mean PRS of a diagnosed population and the probability curve move toward lower PRS values with increasing prevalence, as also illustrated in [67].
The application of this mapping, using identical PRS bands, to the initial population reproduced the original prevalence with high precision and obtained a deviation of less than 2% in a two-sigma (95%) confidence interval for the PRS and prevalence results. Thus, error bars in the graphs would be extraneous. An exception is the population admixture figures in which a small relative change in values necessitated the inclusion of the two-sigma error bars (for example, in Figure 2C).

Simulating Gene Therapy under Population Stratification and Admixture Scenarios
The following simulation steps were performed.
(1) Simulation initialization. The simulation initialization steps were performed, including the allocation of population objects and the assignment of individual PRSs on the basis of the modeled genetic architecture allele frequencies chosen for each population. Individuals were subdivided into two populations, Populations 1 and 2, with equal relative sizes and male/female proportions (configurable in the simulation setup). The initial disease prevalence and genetic architecture effect size in Population 1 were always used as references for Population 2 and the combined population. When gene therapy was performed, it was always applied to Population 1. For the validation of extreme population stratification and admixture scenarios, four sets of genetic architectures were constructed and specified in the simulation configuration. The population differences were set to 100% (all causal SNPs differ between population genetic architectures), 66%, 33%, and 20% (i.e., one-fifth of the causal SNPs differ). The difference was estimated by the fraction of the PRS difference that was attributed to differing SNP architectures between the two populations.
(2) Reproduction. The simulation proceeded through successive generations via reproduction with the configured level of population admixture. The admixture was configurable in a range from 100% to 0%. The rate of 100% meant that exclusively members of the opposite populations reproduce with each other (also referred as "blending", where either population contributes exactly half of the diploid genome to each offspring in a generation). Above 50%, the reproduction is preferentially between opposite populations. The 50% probability means that there is an equal probability that reproduction occurs within the same population and between opposite populations. Lower than 50% values, for example, an admixture level of 10% means that the probability of individuals reproducing within their own population is 90%, and the chance of admixture with the other population is 10%. The offspring of the opposite populations had an equal chance to belong to either population, and the offspring from reproduction within the same population remained in their parents population.
(3) Recombination. Because the parental pairs were chosen in the preceding step, each parent's genome proceeded through recombination. The reported results used an average of 36 Poisson-distributed recombinations per parent in a single linear genome (configurable), and accelerated recombination of 1000 average Poisson-distributed crossovers was used to validate population admixture with a high level of difference in disease genetic architectures between populations.
(4) Gene Therapy. The gene therapy step consisted of sampling risk alleles for each individual chosen as a subject for gene therapy. The requisite number of risk alleles were turned off in order to achieve the chosen PRS improvement. As expected, the population average PRS reached equilibrium during the generation of random mating. The same PRS improvement was achieved by applying the same level of cumulative therapy to the highest-risk individuals or by averaging it over the population or any other population subset. Of the available simulation options, two were found to be the most illuminating: (a) therapy in a single generation of Population 1, followed by a varying degree of admixture with Population 2, and (b) the continuous maintenance of a chosen optimal population health improvement (PRS level) in Population 1, accompanied by varying levels of admixture with Population 2. Gene therapy included the ability to define the set of SNPs to be edited. This was carried out by specifying the desired SNPs in a configuration file, which was valuable for validating the results shown in Section 2.2.
(5) Analysis. The individual risk alleles in each individual were accounted for at a number of stages in the simulation process and aggregated into the population PRS distribution, prevalence analysis, and Fst and AFD statistics, which were saved in comma-separated values format for further analysis and reporting.
(6) Repeat. Steps (2)-(5) were repeated until the defined generation limit was reached. The simulation flow configuration included the option of re-running the same simulations multiple times. This allowed the results of multiple simulation runs to be averaged and the resulting multi-run variance and standard deviation for key statistics to be determined.
The simulation configuration screen, which references the described and additional options, can be seen in Figure A6.

Conclusions
The simulations in this research demonstrated that, even if relatively large heterogeneity in the causal allele set for EMODs existed between populations, it will not be easily detectable by epidemiological studies in admixed populations. While the simulation results show that a large heterogeneity would be hypothetically possible, GWAS findings indicate the existence of a discernible commonality of causal SNPs for polygenic diseases between geographically distinct populations, and the extent of the risk differences between populations due to unique causal SNPs is likely not extreme. Even if it were large, this potential difference would not impede the outcomes of preventive gene therapies if they were applied to turn population-specific true causal SNPs into a naturally existing neutral state of nucleotides, and this would hold after populations admix.
Preventive gene therapy that is designed to turn true causal SNPs into a naturally existing neutral state of nucleotides would result in a decrease in EMOD prevalence proportionate to the decrease in the population relative risk attributed to the edited SNPs. The outcome will manifest differently for LODs, where the therapies would result in a delay in the disease onset and decrease in lifetime risk; however, the lifetime risk would increase with prolonged life expectancy, a likely consequence of such therapies. EMODs exhibit some degree of incidence later in life, and, hypothetically, some of the outcomes may share characteristics with LODs.
In summary, the results of this study show that, if the preventive heritable gene therapies were to be applied on a large scale, even with a fraction of the population participating, the decreasing frequency of risk alleles in the population would lower disease risks or delay the ages of disease onset. With ongoing population admixture, all groups would benefit throughout successive generations.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Geographic and local population genetic stratification and variation complicate the ability to diagnose and treat a number of medical conditions [34]. It is well known that people from different geographic origins may have different rates of specific diseases, physiological responses to medications, and, as a result, different medical treatment outcomes. For example, in the US, the prevalence of type 2 diabetes is 12.8% in African Americans, 8.4% in Mexican Americans, and 6.6% in non-Hispanic whites [40]. Belbin et al. [42] investigated the difference in allele frequencies among individuals in Latin American populations and found that, although they were ostensibly derived from the same population, the top and bottom quartiles of the dominant ancestral component in admixed populations had larger changes in allele frequencies, with 20.4% of sites exhibiting a difference in frequency of >10% in individuals in the upper and lower quartiles with European ancestry in Puerto Rico. For individuals with Native American ancestry in a Mexican population, 36.0% of sites differed by >10%. This characteristic is shared by all groups that have undergone recent admixture and has been magnified by the multi-continental ancestry and local differentiation that underlie the genetic history of Latino populations [42]. A study that reviewed the US Veteran Affairs database [74] found age-adjusted prevalence values of atrial fibrillation (AF) of 5.7% in European Americans, 3.4% in African Americans, 3.0% in Hispanics, 5.4% in Native Americans/Alaskans, 3.6% in Asians, and 5.2% in Pacific Islanders. The differences in prevalence were accompanied by differences in AF symptoms, management, response to anticoagulants, and outcomes for these populations [104]. Another example is coronary artery disease genetic risk, which also varies in prevalence among populations [105]. In addition, breast cancer incidence is higher for Puerto Ricans and Cuban Latinas than for those from Mexico [106], and there are statistical differences by national origin in the rates of prostate, colorectal, lung, and liver cancers [106].
The predictive ability of GWAS and GWAS PRSs also varies broadly if the score is being applied to a population other than the one for which the score was initially determined [36]. For example, Holley et al. [107] observed significant differences in the distribution of SNPs associated with disease risk in New Zealand Maori patients with myocardial infarction compared with those of European origin. The authors concluded that although the genetic risk score (GRS) is overall higher for Maori when applying existing GRS tools, careful evaluation is needed before internationally developed GRS tools can be applied. Africa's haplotype diversity, which is the highest on Earth, has important implications for the design of large-scale medical genomics studies across the continent [108]. Investigations by local research institutions, given their rich local clinical data and case-control base, could help to bridge the existing knowledge gap and provide valuable nuanced genomic information for these communities and their descendants, including those who have emigrated to other regions of the world.
At the same time, there are indications of commonality of gene variants that are causal for polygenic diseases among geographically distinct populations. A study by Seyerle et al. [38], which was performed for five geographically distinct populations, found that, of 21 SNPs implicated as genetic determinants in QT-interval prolongation, seven showed a consistent direction of effect in all populations, and nine showed a consistent effect for four populations and typically small opposite effects for the remaining population. The effect allele frequency (EAF) varied among these populations. A GWAS on 28 diseases in Europeans and East Asians was conducted by Marigorta and Navarro [39], who reported high trans-ethnic replicability, implying common causal variants. Admixed populations usually show an intermediate level of liability or effect. For example, in individuals of Maori descent, nicotine metabolism is 35% lower than that in Europeans, with the metabolism of admixed individuals fitting between those of the two populations [41].
A simulation study by Zanetti and Weale [43] found that a combination of Euro-centric SNP selection and between-population differences in linkage disequilibrium and EAF was sufficient to explain the rate of previously reported trans-ethnic differences, without the need to assume between-population differences in the true causal SNP effect size. These findings suggest that the cross-population consistency found in this study is larger than that usually reported. Martin et al. [45] stated that, contrary to the belief that the polygenic scores of diverse populations are doomed to produce low PRS predictive power, diverse cohorts, rather than homogeneous cohorts, should be used. The authors further claimed that the effect size estimates from diverse cohorts are typically more precise than those from single-ancestry cohorts, and the resolution of causal variant fine-mapping can be considerably improved.

Appendix A.2. A Concise Summary of Gene-Editing Techniques
This study analyzed simulated populational outcomes of a hypothetical future gene-editing therapy for the prophylaxis of polygenic heritable diseases. Many ethical and regulatory considerations will need to be settled before such therapies become practicable [15,109]. Deeper scientific knowledge and more advanced techniques are being developed, particularly for personalized determination (with either computational methods or thoroughly verified genomic databases) of the deleterious effects of common [110][111][112] and rare allele variations and exome mutations [46,[113][114][115][116][117][118]. It may be many years (likely decades) until precise knowledge is of sufficient depth for personalized medicine diagnostics to be conducted.
Gene-editing techniques need to perfect the ability to precisely modify genomics sequences with minimal off-target defects and to develop robust quality control measures for the results of editing. The most promising current technology is clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 [119], a rapidly developing technology, which replaced older technologies such as zinc-finger nuclease (ZFN) [120] and transcription activator-like effector nuclease (TALEN) [121]. In 2019 alone, reports were published on the improved specificity of the CRISPR operation [122], the modification of thousands of nucleotides while reducing DNS nicking [123], and the use of CRISPR-associated transposons to insert custom genes into DNA without cutting it [124], among many other developments. Synthetic genomics, which is mostly in the proof-of-concept stage [125,126], could be another promising future technology. Continuous improvement is required for the technology to reach sufficient specificity, access all areas of the genome, and achieve a sufficiently low number of off-target edits and defects. Only following this will it be well suited for routine gene-editing therapeutic use.

Appendix A.3. Implementation of Common Low-Effect Genetic Architecture
The genetic disease architectures used in this research were based on [17]-a simulation study that determined the number of alleles needed to achieve a statistical distribution variance that corresponds to the heritability of a particular polygenic disease or phenotypic feature. The allele architecture scenarios were implemented in the simulations in an identical manner to that applied in earlier research, in which five genetic architectures were validated and common low-effect-size genetic architecture was determined to indeed best fit the observed experimental and clinical data (see [49] for a comprehensive description). The common low-effect-size genetic architecture was used throughout this study. A concise summary of its major concepts, reformulated in terms of allele relative risks and the implementation steps that differed in this research follows.
The resulting variance of the allele distribution was determined to be where p k is the frequency of the kth genotype, and r k is the relative risk of any additional liability presented by the kth allele for a particular individual. The contribution of genetic variance to the risk can be expressed as the disease heritability: where π 2 /3 is the variance of the standard logistic distribution [127].
Following [17], the variants were assigned to individuals with frequencies proportionate to the minor allele frequency (MAF) p k for SNP k, producing, in accordance with the Hardy-Weinberg principle, three genotypes (AA, AB, or BB) for each SNP with frequencies of p 2 k , 2p k (1 − p k ), and (1 − p k ) 2 . The diploid Wright-Fisher model simulation with recombination requires the tracking of SNPs on two chromosomes, and the individual PRSs β ind for k SNPs were calculated as follows: where a k c (0 or 1) is the state of the kth SNP on chromosome c (1 or 2), and r k is the relative risk of the additional liability presented by the kth allele for a particular individual. The population mean PRS value β mean was calculated from the genetic architecture distribution using the following equation: Two populations were always used in the simulations, and the β mean value of the first (reference) population was applied to both populations, making it easy to compare the populations' PRSs, as well as the distribution of higher-and lower-risk individuals within and between the populations.
For the common allele low-effect-size genetic architecture model, which, based on [17], was expected to be the most suitable for explaining the heritability of the analyzed LODs, the risk alleles were discretized into five equally spaced values within the defined range, with an equal proportion of each allele and an equal odds ratio in each. In this case, the MAFs were distributed in equal proportions of 0.073, 0.180, 0.286, 0.393, and 0.500, while the relative risk (RR) values were 1.15, 1.125, 1.100, 1.075, and 1.05. Thus, 25 combinations were possible. These entire blocks were repeated until the target heritability level was achieved, which, in this case, was 36 times for h 2 = 50%. Figure A1 demonstrates the populations' SNP and PRS distribution for the 50% heritability scenario and the 80% scenario used in the simulations of Dupuytren's disease. The genetic architecture scenarios were defined in comma-separated values (CSV) files in the executable folder. In this study, the files were given names such as 'A0.txt' and 'A11S.txt'. The file 'ConditionFiles.txt' specifies which genetic architecture files were loaded for a given simulation run. It contains two columns: the first column defines the number of times to repeat the loading of a genetic architecture file to achieve the desired heritability, and the second column specifies the name of the genetic architecture file. The simulations always operated on two populations. Therefore, two files were always specified in two lines; the '#' symbol was used as the first character in a line to comment out that particular line. Genetic architectures can be specified by the same architecture file when the initial population is homogeneous, or each file can represent different allele frequencies, but the effect sizes must match.
Only the following columns from the genetic architecture files were applicable to these simulations (the remaining columns may be set to 0 and ignored): SNP denotes the RSxxx-style SNP identifier; EAF is the effect allele frequency; and OR is the odds ratio, which is actually the allele relative risk in this case. Additional SNP lines were used to facilitate specific analyses. This was accomplished by duplicating the entries and setting alternate entries to either the EAF required in the genetic architecture or a low frequency to simulate a different allele or an allele that was not represented in the populations.   Average edited SNPs Figure A4. The average number of edited SNPs per individual in the scenario in which gene therapy maintains a constant optimal level of disease risk in the population participating in gene therapy, with differing degrees of admixture with a non-participating population. The depth of the first edit was identical for all admixture scenarios. The highest admixture rate among populations led to a initially higher number of the maintenance edits, and the asymptotic balance-the point at which maintenance edits were no longer needed-was reached more quickly. Comparatively, lower levels of admixture needed a lower initial number of maintenance edits per generation to maintain a constant risk level for the population participating in therapy. However, the number of generations required to reach equilibrium was much larger.  Figure A5. Relative PRS and prevalence progression during population admixture. The normalized relative change in the population PRS and disease prevalence, where "1" is the initial value of the variable, and "0" is the equilibrium value. (A) shows the ∆PRS relative to the equilibrium value of the normalized PRS. The displayed fractions are identical to the simple population admixture proportions that would occur at the displayed rates of admixture; (B) shows the ∆Prevalence relative to equilibrium normalized disease prevalence progression for population not directly participating in the preventive therapy, when participating population average PRS is maintained at improved tenfold PRS level (RR = 10.0, PRS = −2.30); (C) shows the ∆Prevalence relative to equilibrium normalized disease prevalence progression for population not directly participating in the preventive therapy, when participating population average PRS is maintained at improved fourfold PRS level (RR = 4.0, PRS = −1.386). The normalized relative improvement for population not directly participating in therapy was slightly slower if compared to (B). For 10% admixture rate, the improvement in the first three admixed generations was to the level of 93%, 86% and 80% for RR = 4.0 scenario, compared with 92.5%, 84% and 76% for tenfold improvement scenario (RR = 10.0) in (B). The higher admixture ratios show even faster prevalence reduction. This shows the comparable relative prevalence reduction, even though the absolute asymptotic reduction differs 2.5 times between these scenarios. (D) shows, for comparison with Figure 5D, the absolute prevalence improvement thanks to admixture for populations not participating in preventive gene therapy, when participating population average PRS is maintained at improved fourfold PRS level (RR = 4.0, PRS = −1.386); the corresponding normalized figure is depicted in (C).