A Weighted Genomic Relationship Matrix Based on Fixation Index (FST) Prioritized SNPs for Genomic Selection

A dramatic increase in the density of marker panels has been expected to increase the accuracy of genomic selection (GS), unfortunately, little to no improvement has been observed. By including all variants in the association model, the dimensionality of the problem should be dramatically increased, and it could undoubtedly reduce the statistical power. Using all Single nucleotide polymorphisms (SNPs) to compute the genomic relationship matrix (G) does not necessarily increase accuracy as the additive relationships can be accurately estimated using a much smaller number of markers. Due to these limitations, variant prioritization has become a necessity to improve accuracy. The fixation index (FST) as a measure of population differentiation has been used to identify genome segments and variants under selection pressure. Using prioritized variants has increased the accuracy of GS. Additionally, FST can be used to weight the relative contribution of prioritized SNPs in computing G. In this study, relative weights based on FST scores were developed and incorporated into the calculation of G and their impact on the estimation of variance components and accuracy was assessed. The results showed that prioritizing SNPs based on their FST scores resulted in an increase in the genetic similarity between training and validation animals and improved the accuracy of GS by more than 5%.


Introduction
Recent advances in high-throughput genotyping and sequencing techniques have led to the generation of dense marker panels and facilitated the genotyping of large numbers of individuals. Because of the availability of these cost-effective genotyping technologies and the increase in sequencing speed, large-scale genotyping for single-nucleotide polymorphisms (SNP) has become more affordable and accessible. Genomic data provide an unprecedented opportunity to dissect the genetic basis of complex traits and identify relevant functional associations.
From an animal breeding perspective, the use of genomic information results in a substantial reduction in generation interval and an increase in the accuracy of predicted breeding values, leading undoubtedly to an improvement in the genetic response [1][2][3][4][5][6]. Genomic selection (GS) is often carried out using multiple regression or mixed linear models [7][8][9][10][11][12]. For both methods, the density of the SNP marker panel and the linkage disequilibrium (LD) structure between markers and quantitative trait loci (QTL) have a great impact on accuracy. Regression-based methods directly model the association between the phenotypes and all or a subset of the genotyped variants. Thus, their problems stem mainly from the high dimensionality of the parameter space. As the effect of a QTL (often small for complex traits) is distributed in a nontrivial manner between all markers that are in LD with the causal mutation, there is little statistical power to accurately estimate its effect. Traditionally, SNP filtering is conducted based on certain statistical criteria such as p-values for single-marker analyses or quality-of-fit and model determination for Bayesian procedures such as BayesB [13] and BayesR [14]. The latter has shown some superiority for certain traits in the presence of low-and moderate-density marker panels as compared with models that include all markers, however, they still suffer, although to a lesser degree, from high false positives, multiple testing problems, high LD, and small SNP effects, which have hampered at different degrees the efficiency of these methods [15][16][17]. Although these factors are likely to affect the prioritization of relevant variants, they have limited to no effects on prediction. An increase in SNP marker density, after a certain threshold, seems to not affect the quality of the estimated observed relationship matrix (G) and thus the performance of mixed linear model-based approaches. There was no difference in accuracy between the 777K SNP and the 54K SNP panels [18]. This is because the quality of G when either 777K or 54K SNP panel were used was not that different. Due to these limitations, prioritization of variants to be included in the association model or to compute the genomic relationship matrix has become a necessity. Commercial livestock species are under heavy artificial selection. The effects of such selection on the genome can be traced through the changes in allele frequencies. The fixation index (F ST ) measures the rate of fixation through the increase in homozygosity and it has become an important tool to study population structure in humans, animals and plants. Chang et al. [19] proposed utilizing the F ST which measures the allele differentiation among subpopulations to identify segments of the genome under selection pressure. There was an increased genomic similarity and improvement in the accuracy of genomic selection when the F ST scores were used to prioritize SNP markers in high-density panels as compared with using BayesB [20] and BayesC [21] approaches. Furthermore, they showed that the genomic relationship matrix and the accuracy could be improved using prioritized SNPs based on the F ST scores.
Genomic best linear unbiased prediction (GBLUP) assumes equal weight for all SNPs [22,23]. Sun et al. [24] developed a two-step method for calculating weights in weighted GBLUP (WGBLUP). If weights are known, WGBLUP calculates genomic estimated breeding values (GEBVs) similar to the Bayesian method using the same weight. This method is effective for distinguishing major QTL, however, the accuracy of the GEBVs is reduced since it shrinks small SNP effects to zero. To achieve the highest accuracy, the weight formula needs to be modified to avoid having SNPs with no effect. The use of a genomic relationship matrix that weights marker's contribution can improve prediction accuracy, but the improvement is trait and population specific due to differences in genetic architecture. Weighted single-step GBLUP (WssGBLUP) have been developed for estimating weights within single-step GBLUP (ssGBLUP) process [25,26]. Wang et al. [26] and Fragomeni et al. [27] evaluated the performance of the WssGBLUP approach using simulation data. Two iterations of weights were calculated from the variance explained by each SNP. Their results showed that weighting SNP could be effective in improving the accuracy of GEBV prediction and in the estimation of marker effects.
With all these methods, the challenge was to determine how to derive the optimum set of weights to compute the genomic relationship matrix. In this study, the F ST scores-based prioritization method developed by our group (Toghiani et al. [28] and Chang et al. [19]) has been be expanded to derive the needed weights to compute G. The specific objectives of this study were to derive F ST scores-based relative weights for SNPs included in the computation of G and to assess the impact of different strategies on the estimation of variance components and the accuracy of genomic selection.

Data Simulation
The QMSim simulation software [29] was used to simulate the genomic and phenotypic data. A randomly mated historical population was generated to initialize LD and to establish mutation-drift equilibrium and was used as a base to create a population with average LD between adjacent markers of 0.3. Three hundred historical generations were simulated based on random mating of an initial 8,000 animals, followed by an additional 5, 10, and 20 generations with population ranging between 12,000 and 17,000 animals. The base population (G 0 ) was founded by 1000 males and 15,000 females randomly selected from the historical generation. A trait with heritability equal to 0.30 was simulated and all genetic variation was assumed to be due to the simulated QTL. The mating system was at random throughout up to generation G 0 . We also simulated an additional 15,000 animals for each of 7 additional generations (G 1 -G 7 ). The parents were sampled on their estimated breeding values (EBVs), with a replacement rate of 50 and 20% for males and females, respectively. We assumed one progeny per mating and a sex ratio of 50%. Each simulation scenario was replicated 5 times. The average of the effective population size was equal to 323. Data from generation 6 (G 6 ) was used as a training population and that of generation 7 (G 7 ) was used to evaluate (validation population) the proposed method. All animals in the training and validation populations were genotyped with 400,000 SNP markers simulated to be uniformly distributed along 10 chromosomes of 100 cM in length each to approximate about 1.2 million SNP markers in the bovine genome. We sampled 200 biallelic QTL from a Gamma distribution with shape and scale parameters equal to 0.4 and 0.15 respectively. We did not allow any overlap between the SNP markers and QTL.
Additionally, QTL were assumed not to be genotyped. The residual variance was scaled accordantly in each scenario of selected SNPs such that the heritability and phenotypic variance were constant at the values of 0.3 and 1, respectively. Trait phenotypes were generated as the sum of an overall mean, the random additive effects of QTL and their associated genotypes and the residual terms. The latter were sampled from a normal distribution with zero mean and variance-covariance matrices Iσ e 2 where σ e 2 is the residual variance.

SNPs Prioritization Based on F ST Scores
Briefly, divergence between populations and subpopulations is often due to differential selection pressure. Wright's fixation indexes (F ST ) have been used to measure the level of genetic differentiation between populations based on change in allele frequencies. The F ST scores were calculated following Nei [30] and Chang et al. [19]. Specifically, the trait phenotypes for animals in generation 6 (G 6 ) were divided into three sub-populations based on the 5% and 95% quantiles (below the 5% quantile (S 1 ), between 5% and 95% quantiles (S 0 ), and above the 95% quantile (S 2 )). Genotypes of individuals (1500) in sub-populations S 1 and S 2 were used to calculate the F ST scores. For each locus, the global F ST estimator was defined as: where p Si and q Si are the allele frequencies in subpopulation i, n S1 and n S2 are the number of individuals of the subpopulations S 1 and S 2 , H S is the average heterozygosity of subpopulations, and H T is the heterozygosity based on the total population.

Prioritized SNPs and Genomic Relationships
Several methods have been proposed to calculate the genomic relationships [31][32][33][34][35]. In animal breeding applications, the genomic relationship matrix (G) is often calculated using the method proposed by VanRaden [32]. It basically measures the similarity of marker genotypes between two individuals at a large number of loci independent of their mode of inheritance. Estimating observed additive relationships using identity by state provides a better estimate than using pedigree information, but still suffers from several problems including nonzero estimates of realized relationship between two individuals that are not related by ancestry as it was shown by [36][37][38], negative off-diagonal elements, and the inevitable noise associated with these estimates. Furthermore, several studies [4,18,39] have shown that little to no improvement in G were observed with an increase in the number of SNPs used for its calculation. Current methods used to calculate G, generally, give the same weight to all the markers, and thus could not guarantee the optimality of genetic similarity between individuals at the QTL. For that purpose, contributions of the SNPs used to compute G have to be weighted according to their importance on the phenotype (strength of association with the phenotype). To maximize the functional genomic similarity between individuals, the SNPs have to be prioritized based on their ability to increase genetic or phenotypic similarity between individuals. Conversely, individuals with different genetic values or phenotypes are likely to have much lower genomic similarity at QTL than the expected or observed additive relationships. It is worth mentioning that F ST is only a measure of population differentiation and an increase in functional similarity is achieved through an increase of the relative weight of prioritized markers.
The challenge in maximizing the genomic similarities is finding the relative weights for the SNPs used in the calculation of G. In this study, F ST scores were used to prioritize and to assign relative weights to the SNP markers. The top 20K SNPs based on their F ST scores were used either alone or with the remaining 380K SNPs to compute G with or without weighting. When only the top 20K SNPs were used to compute G, the following two scenarios were considered: 1) equal weights for all SNPs or 2) weights proportional to each SNP F ST score. When all 400K SNP markers were used, the different weighting scenarios evaluated are presented in Table 1.
The relative weights were calculated using the following equation: where w i is the relative weight for SNP i, Fst i is the F ST score for SNP i and N is the total number of SNPs (400K or 20K).

Data Analysis
For all scenarios, 10,000 and 5000 animals were randomly selected from G 6 and G 7 , respectively. For each scenario, the genomic relationship matrix was computed with the appropriate number of Genes 2019, 10, 922 5 of 10 markers and the weighting factors and the analysis was carried out using the following mixed linear model: where y is a N × 1 column vector of phenotypes, X is a N × p known incidence matrix of the p predictor variables, b is a p × 1 column vector of fixed effects regression coefficients, Z is a N × q known incidence matrices with the appropriate dimensions for the q random effects, u is a q × 1 column vector of genomic breeding values, and e is a N × 1 column vector of random residuals. Additionally, it was assumed that u ∼ N 0, Gσ 2 u , with σ 2 u being the genetic variance. The AIREMLF90 program [40] was used to estimate variance components and to predict the genomic breeding values for the different scenarios. Accuracy of genomic evaluation was defined as the correlation between true and estimated breeding values in the validation population. Each simulation scenario was replicated 5 times. Table 1 presents the estimates of the variance components and heritability and their associated standard deviations for the different scenarios when all 400K SNP markers were used to compute the genomic relationship matrix. In general, the percentage of genetic variance recovered increased with a decrease of the percentage weight assigned to the prioritized top 20K SNPs reaching a maximum when the top SNPs (based on F ST scores) accounted for 25% or less of the weights used to compute G. In all cases, the genetic variance was underestimated when no weights were used (scenario 7 in Table 1). Similarly, only two-thirds of the genetic variance were recovered when zero weights were assigned to the 380K nonprioritized SNPs (scenario 1 in Table 1). Table 2 presents the distribution of off-diagonal elements of G for different weighting scenarios. In fact, the portion of genomic relationships between training and validation individuals exceeding 0.03 was 5.24% when all 400K SNPs were used with equal weight. The same portion was 5.59%, 7.22%, 11.13%, 14.38%, and 16.78% when the relative weight assigned to the top 20K prioritized SNPs in the calculation of G was 25%, 50%, 75%, 90% and 100%, respectively. When only the top 20K prioritized SNPs were used to compute G, weighting the contribution of each marker by its F ST score resulted in an increase in the off-diagonal elements exceeding 0.03 (Table 3). The increase in the percentage of off-diagonal elements exceeding 0.03 is an indicator of increased similarity between the training and validation datasets and could lead to increase in accuracy. Table 2. Distribution of off-diagonal elements (OD) of the genomic relationships matrix corresponding to the training and validation individuals using all 400 SNPs and for different weighting scenarios for the prioritized 1 (20K) and nonprioritized (380K) SNPs (in %).

OD Weights (w i w j ) No Weight (w i =1)
OD When the same weight (w i = 1) was used for all 400K SNP markers to compute G, the accuracy of genomic prediction (correlation between true and predicted BVs) was 0.690 ( Figure 1) and it increased to 0.718 when all SNPs in the panel were weighted by their relative F ST score. When the relative weight of the top 20K prioritized SNPs in the calculation of G increased, higher accuracy was achieved. In fact, accuracy increased by 4.3%, 5.2%, 5.4%, 5.3%, and 5.2% as compared with the scenario where all markers had the same weight (w i = 1) when the relative weight assigned to the top 20K prioritized SNPs in the calculation of G was 25%, 50%, 75%, 90% and 100%, respectively (Figure 1).
A comparison between the different weighting scenarios for the contribution of the 20K prioritized SNPs and the remaining 380K markers showed a superiority for scenarios one to six as compared with scenario seven (equal weights) in terms of quality-of-fit of the model (Table 4). In fact, the −2Log likelihood ranged from 26,397.30 to 26,746.90 for scenarios one to six with the best fit being for the third and fourth scenarios. When the same weights were assigned to all 400K SNP (scenario seven), the −2Log likelihood was 27,378.30. Similar behavior was observed for the estimated residual variance ( Table 4). Regression of the estimated breeding values on the true ones showed a systematic under estimation for all seven scenarios, although the bias was slightly smaller for scenarios one to six (Table 4).   Figure 1. Accuracy of genomic prediction for different weighting scenarios for the contribution of the 20K prioritized SNPs and the remaining 380K markers (x,y). Horizontal lines indicate the accuracy using only the top 20K SNPs with (red) or without (green) weights SNPs.

Discussion
We showed that only a portion of the genetic variance was recovered for the different scenarios. The inability to recover all the genetic variance is due to the large number of QTL with very small effects. In fact, 55% of QTL have a true effect smaller than one-tenth of one percent of the genetic variance and an additional 20% of QTL have an effect smaller than 0.5% of the total genetic variance. These small effect QTL are hard to track effectively when the LD is moderate to low. Across the different scenarios, there is an underestimation trend of the residual variance, although it does not seem to be any systematic bias. Heritability was clearly underestimated when the majority of the weight (≥90%) was allocated to the prioritized top 20K SNPs (scenarios one and two in Table 2). In fact, for those scenarios, estimates of the heritability are likely to be biased. For the remaining scenarios, although there is a general trend of an underestimation of the heritability, estimates are not likely to be biased. When only the unweighted top 20K prioritized SNPs were used to compute G, the genetic and residual variances were very similar to the estimates obtained for scenario one in Table 2.
Intrinsically, the contribution of a SNP marker to the estimation of G is weighted by its minor allele frequency (MAF), thus favoring markers with low MAF. However, it is not weighed by the size of the marker effect. Consequently, after a certain number of SNP markers are included in the computation of G, little to no improvement is expected. Chang et al. [19] showed that the limited change in G with additional markers could be an indicator of the sufficiency of available SNPs in estimating the realized relationships. However, such sufficiency is not a guarantee of the optimality of such matrix for the implementation of association and genome selection analyses. In fact, as the number of randomly selected SNPs increased from 40K to 400K, the matrix G inched closer to the expected additive relationship matrix (A). Furthermore, they showed that a genomic relationship matrix computed based on a selected subset on 20K markers was markedly different from A. In this study, we further prove that within those selected 20K SNPs additional improvements could be achieved through appropriate weighting of the contribution of these SNPs in the calculation of G.
Weighting all markers with their relative F ST scores resulted in a 4.3% increase in accuracy as compared with the same weight scenario (w i = 1). Using only the prioritized 20K SNPs with or without Genes 2019, 10, 922 8 of 10 weights resulted in a 5.2% and 3.5% increase in accuracy as compared with the same weight scenario. As the density of the marker panel increases, using all SNPs to compute G is not the best option.
These results clearly indicate that additive relationships between individuals could be accurately estimated with a reasonably small number of well distributed SNP markers, however, that does not mean that the accuracy of genomic selection cannot be improved using high-density marker panels or even sequence data. To achieve that goal, the genomic matrix has to evolve from a measure of additive relationships to an optimum measure of genetic similarity at QTL between individuals. The F ST scores seem to be an efficient prioritization tool to achieve such a goal, however, it should be noted that F ST scores are only measures of fixation index. A combination of metrics of fixation index and index of genetic differentiation could lead to better representation of population partitioning [41] and could enhance the prioritization and weighting of SNP markers.

Conclusions
The dramatic increase in the number of identified common and rare variants due to advances in NGS was expected to significantly increase the accuracy of GWAS and GS. Unfortunately, little to no improvement in accuracy was observed using NGS or high-density marker data. In spite of the repeated argument that all needed information is already captured by the available marker panels, the results of this study clearly show that the lack of improvement in accuracy is due to the limitations of the methods used rather than the limited additional information in the high-density and sequence data. Prioritizing SNP markers based on their F ST scores and using the latter to compute relative weights has increased the genetic similarity between training and validation animals. Furthermore, it resulted in more than 5% improvement in accuracy. These results clearly indicate that additive relationships between individuals could be accurately estimated with a reasonably small number of well distributed SNP markers, however, that does not mean that accuracy of genomic selection cannot be improved using high-density marker panels. The genomic matrix should evolve from a measure of realized additive relationships to an optimum measure of genetic similarity between individuals. The current method used to calculate the genomic relationship matrix gives the same weight to all the markers and thus does not guarantee the optimality of genetic similarity at QTL.