Single-Step Genomic Evaluations from Theory to Practice: Using SNP Chips and Sequence Data in BLUPF90

Single-step genomic evaluation became a standard procedure in livestock breeding, and the main reason is the ability to combine all pedigree, phenotypes, and genotypes available into one single evaluation, without the need of post-analysis processing. Therefore, the incorporation of data on genotyped and non-genotyped animals in this method is straightforward. Since 2009, two main implementations of single-step were proposed. One is called single-step genomic best linear unbiased prediction (ssGBLUP) and uses single nucleotide polymorphism (SNP) to construct the genomic relationship matrix; the other is the single-step Bayesian regression (ssBR), which is a marker effect model. Under the same assumptions, both models are equivalent. In this review, we focus solely on ssGBLUP. The implementation of ssGBLUP into the BLUPF90 software suite was done in 2009, and since then, several changes were made to make ssGBLUP flexible to any model, number of traits, number of phenotypes, and number of genotyped animals. Single-step GBLUP from the BLUPF90 software suite has been used for genomic evaluations worldwide. In this review, we will show theoretical developments and numerical examples of ssGBLUP using SNP data from regular chips to sequence data.


Introduction
In the early 1980s, Soller et al. [1] hypothesized that DNA markers like RFLPs (restriction fragment length polymorphisms) would be beneficial in constructing more precise genetic relationships, followed by parentage determination, and the identification of quantitative trait loci (QTL). The high cost of genotyping animals for such markers probably prevented the early widespread use of this technology. When the first draft of the Human Genome Project became available in 2001 [2], one of the most exciting news that came along was that the majority of the genome sequence variation can be attributed to single nucleotide polymorphisms (SNPs). The reality is that SNP markers have become the bread-and-butter of DNA sequence variation [3] and they are now an important tool to determine the genetic potential of livestock. This is because SNPs are abundant, as they are found throughout the entire genome [4], as in introns, exons, promoters, enhancers, or intergenic regions. In fact, there are about three billion nucleotides in the bovine genome, and there are over 30 million SNPs or one every 100 nucleotides parametrizations exist, but if AA = 0, AB = 1, and BB = 2, M has to be centered by allele frequency. Assuming a vector p with elements equal to p i , the frequency of allele B at locus i: To understand why Z is a centered matrix of allele content, we can use only one biallelic marker. If the effect of each copy of the A allele is a and the frequency of AA is p 2 , individuals with AA have a (non-centered) breeding value u = 2a; individuals aa have u = 0 with a frequency of q 2 ; individuals Aa have u = a with a frequency 2pq. The variance explained by this marker is Var(u) = E u 2 − E(u) 2 [34]. The average of u is 2ap 2 + a2pq; which becomes 2pa. The variance explained by one marker is: (2a) 2 p 2 + 2pq(a) 2 − (2pa) 2 = 2pqa 2 . Given the average of u is 2pa, as shown above, we can compute the covariance between individuals i and j for this marker. If we express the breeding values of the animals i and j as ma deviated from the population mean [34], we obtain Equations (2) and (3): According to Legarra et al. [34], if Var(a) = Iσ 2 a , or marker variance, and the genetic variance in Hardy-Weinberg equilibrium is 2 p i q i σ 2 a , the rules of variances and covariances can be applied: If instead of using the allele coding 0,1,2 we use −1,0,1: Dividing the covariance by the genetic variance 2 p i q i σ 2 a , we get realized relationships. Going from one to several markers, the breeding value of an animal can be calculated as the sum of SNP effects weighted by the genotype content (u = Za). Assuming the same variance per locus, the variance of u is: Var(u) = Var(Za) (6) Var(u) = Z Var(a) Z Var(u) = ZZ σ 2 a If the genetic variance σ 2 . Replacing σ 2 a in (8) we have that: Therefore, and according to VanRaden [9], the genomic relationship (G) is given by: then, Var(u) = Gσ 2 u Therefore, genomic relationships are standardized covariances. When ZZ is divided by 2 p i (1 − p i ), G becomes analogous to the pedigree relationship matrix (A). The G matrix contains the number of homozygous loci for each individual in the diagonals, and the number of alleles shared among individuals in the off-diagonals. Other ways to construct the genomic relationship matrix are described in the literature. For more details, check Leutenegger et al. [35] and Amin et al. [36].
If G is centered using observed allele frequencies, the average over all elements is zero and average diagonal is 1 when there is no inbreeding. However, it is only when base allele frequencies are used that elements of G can be interpreted as elements of A (this will be more detailed later). In general, G traces inbreeding much further than A because of its IBS nature and because A is limited by the recent pedigree recording.
When the number of genotyped animals is bigger than the number of SNPs, or if there are similar individuals (e.g., clones), G becomes singular; therefore, cannot be inverted. To overcome this problem, usually, G is "blended" with a small percentage of an identity matrix or the pedigree relationship matrix among genotyped animals (A 22 ): (13) where the blending parameter α is usually 95% but can vary from 99 to 80% [37], or even to 50% [38]. The blending parameter (1 − α) can be understood as the fraction of genetic variance not explained by markers and computed by maximum likelihood methods (see below).

From GBLUP to ssGBLUP
Understanding the difference between GBLUP and ssGBLUP is a crucial step. Because there is still a lot of confusion, an explanation about GBLUP is provided.
The GBLUP is equivalent to SNP-BLUP, but in GBLUP genomic breeding values (u = Za) are estimated, instead of SNP effects (a) in SNP-BLUP. It also assumes that SNPs have a priori a normal distribution; the majority of SNPs have a small effect, and very few have moderate to large effect. Using a simple animal model as shown in (14) and (15): where W is the incidence matrix for animal effect (u), X is the incidence matrix for fixed effects (b), σ 2 e is the residual variance, and u ∼ N 0, Gσ 2 u . Therefore, GBLUP is a BLUP where A is replaced by the genomic relationship matrix. The effectiveness of GBLUP will depend on the ability of G to approach the realized genetic relationships. In addition, performing a quality control of genomic data before constructing G avoids biases and losses of accuracy.
If we assume that not all the genetic variance is explained by markers, an extra polygenic effect can be included to explain the remaining variance. In this case, the model in (14) becomes: where g is a vector of residual polygenic effect that is not captured by the SNPs. Assuming that α is the proportion of variance explained by SNPs, the total additive genetic effect (u g ) becomes Var(u g ) = αGσ 2 Therefore, In real situations, it is assumed that α varies from 0.8 to 0.95. Note that this is also going to make G invertible [17]. When (1 − α) is used strictly to make G (semi-) positive definite, it is called a blending parameter.
Although GBLUP has been widely used in animal and plant breeding applications, its main problem is that only genotyped animals are in the model. As only a fraction of animals is genotyped, GBLUP may have less phenotypic and pedigree information than BLUP. Because of that, some extra steps are needed to combine genomic and pedigree information. When using GBLUP, SNP-BLUP or Bayesian models, the genomic evaluation method is called multistep. The steps involved in multistep are: (1) Estimation of EBV using traditional BLUP (i.e., all available information); (2) de-regression of EBV, which condenses information from phenotypes (e.g., daughter yield deviation in dairy cattle); (3) estimation of SNP effects using GBLUP or other models; (4) prediction of Za, which is also known as direct genomic values (DGVs); (5) blending DGVs with average of parent's EBV, which is known as parent average (PA), with published EBV, or with PTA. The main issue on having an evaluation with several steps is that some errors and biases can be introduced during those steps [10], and that BLUP will not be robust to genomic selection decisions [13].
The idea for ssGBLUP came from the fact that usually only a small portion of the animals, in a given population, is genotyped. In this way, the best approach to avoid several steps would be to combine pedigree and genomic relationships and use this matrix as the covariance structure in the MME. Legarra et al. [15] stated that genomic evaluations would be simpler if genomic relationships were available for all animals in the model. Then, their idea was to look at A as a priori relationship and to G as observed relationships; however, G is observed only for some individuals, and those individuals have A 22 as a priori relationship. Based on that, it was shown that the genomic information could be extended to non-genotyped animal based on the joint distribution of breeding values of non-genotyped (u 1 ) and genotyped (u 2 ) animals [15,17]: If we consider that Var(u) = Aσ 2 u (22) where subscripts 1 and 2 represent non-genotyped and genotyped animals, respectively. The conditional distribution of breeding values for non-genotyped and genotyped animals is If u 2 in A 12 A −1 22 u 2 is replaced by a vector of observed gene content, the formula can be used to estimated gene content for non-genotyped animals based on observed gene content for genotyped animals [39]. It implies that by using A 12 A −1 22 u 2 the genomic information can be implicitly imputed from genotyped animals to non-genotyped based on pedigree relationships. The variance in the distribution (A 11 − A 12 A −1 22 A 21 ) is the prediction error term. Therefore, because the animals with subscript 1 have no genotypes, the variance depends on their pedigree relationships with genotyped animals. In this way, variances and covariances are: Rearranging: Cov(u 1 , Finally, the matrix that contains the joint relationships of genotyped and non-genotyped animals is given by: which can be simplified to: This H matrix is; therefore, a relationship matrix constructed with SNP markers and pedigree, where the SNP information is projected to the individuals that are not genotyped. Some of its properties include being always semi-positive definite and being positive definite and invertible if G is invertible. Although H is very complicated, its inverse (H −1 ) is quite simple [16,17]: As both A −1 and G −1 capture relationships, A −1 22 should be subtracted to avoid double-counting of pedigree information for genotyped animals.
Assuming the following animal model: The MME for ssGBLUP becomes: The distribution of u becomes: Therefore, the only difference between BLUP and ssGBLUP is that A −1 is replaced by H −1 . Subsequently, all tools based on BLUP mixed model equations, as the restricted maximum likelihood (REML [40]), can be easily converted to single-step. In a nutshell, if all animals are genotyped, ssGBLUP becomes GBLUP, but if no animals are genotyped, ssGBLUP becomes BLUP.
Advantages of ssGBLUP include simplicity of use, simultaneous fit of genomic information and estimation of fixed effects [10], relatively higher accuracy than multistep methods [41][42][43][44][45], potential to account for pre-selection bias as all pedigree, phenotypic, and genomic information can be included in the model [12,13], and can be easily extended to any model.

Applying ssGBLUP to a Simulated Data Using blupf90
A dataset that mimicked a cattle population was simulated using QMSim [46]. Pedigree information and phenotypes for 10,000 animals, and genotypes for 1020 parents from generations 1-4 and 1004 individuals in generation 5 were generated. Files with pedigree, phenotypes, and genotypes are available at https://github.com/danielall/Data_ssGBLUP. Shortly, the pedigree file is named pedigree.txt and contains three columns: animal, sire, and dam. The phenotype file is named phenotypes.txt and contains animal, sex, phenotype, true breeding value, and generation. Phenotypes (y) were generated as y = sex_effect + true_breeding_value + residual. Genotypes were coded based on the number of copies of the alternative allele (0, 1, 2) and are in a file named genotypes.txt, with: animal and SNP_genotype. The last file (gen_map.txt) contains the map for SNPs: SNP identification, chromosome number, position (in base pairs).
After running renumf90 to renumber the data (see Appendix B), the renumbered phenotype file is named renf90.dat and contains phenotype, renumbered sex code, and renumbered animal ID; the renumbered pedigree file is renadd02.ped; and the parameter file generated by renumf90 is named renf90.par (Box 1). This parameter file was created based on the following model: y = sex + u + residual, where u is the animal effect or direct additive genetic effect. To run ssGBLUP, blupf90 can be used with the parameter file given in Box 1 (see Appendix B for a description of keywords and values). The following command line can be used to save the screen output to a file: blupf90 renf90.par | tee blupout.log The above command will provide the parameter file when blupf90 asks for it and will save the screen output to a file named blupout.log.  The output file provided by blupf90 with solutions for all effects is the "solutions" file, and the first 5 lines of this file are shown in Box 2. The first line is a header indicating columns for trait, effect, level, and solution. In this example, only one trait was used, so all entries in the trait column are 1; the effect column contains the number of the effects in the model (i.e., sex and animal effect); level refers to the levels of the effects (i.e., 2 for sex and 12,010 for animal effect (direct additive genetic)); the last column contains the solutions for all levels of the effects in the model. As ssGBLUP was used by blupf90 because the option OPTION SNP_file was included, solutions of the animal effect are GEBV for both genotyped and non-genotyped animals. It is important to remember the effects were renumbered using renumf90, so the original and renumbered levels for fixed effects and animal effect are in renf90.tables and renadd02.ped, respectively (see Appendix B). To have GEBV matched back to the original ID, a simple R script, as the one in Box 3, can be used. rm(list=ls()) sol<-read.table("solutions", skip=1) sol_gebv<-subset(sol,sol[,2]==2) names(sol_gebv)<-list("trait","effect","level","solutions") ped<-read. The blupf90 software outputs a large amount of information on the screen, including quality control checks, statistics for G and A 22 and respective inverses, and statistics for G −1 − A −1 22 . This is because when the option OPTION SNP_file is used in blupf90, it turns the genomic library on and all checks are done. To avoid doing quality control of genomic data when using blupf90, add the following option at the end of the parameter file: OPTION no_quality_control. The genomic library has an interface software called preGSf90, which contains a myriad of options. To check all options available in the genomic library: http://nce.ads.uga.edu/wiki/doku.php?id=readme.pregsf90. To see how to use preGSf90 to perform quality control and preprocessing of genomic data, check Appendix C.

Compatibility between Pedigree and Genomic Relationships
Based on how H is constructed, the central element is G − A 22 (see Equation (31)), which implies both matrices should be compatible [10,48]. Compatibility can be understood as both matrices referring to the same genetic base and to the same genetic variance. However, genomic relationships can be biased if G is constructed based on allele frequencies other than the ones from the base population [9]. However, allele frequencies from the base population are not known because of the recent recording of pedigrees (i.e., the base population per se is unknown). Although those frequencies can be estimated using the method proposed by Gengler et al. [39], these estimates are not very accurate because the base population is several generations away from the genotyped individuals. Additionally, in certain contexts such as missing pedigrees there is not a uniquely defined base population. Most commonly, allele frequencies based on the recent genotyped population are used to construct G. When this is the case, the expectation of breeding values for genotyped animals is 0 [9]. However, if the population is under selection, mean breeding values should change from the base population to the genotyped individuals (i.e., they should deviate from 0). To account for selection and for the fact genotyped animals are more related through A 22 than G is able to reflect (i.e., especially when current allele frequencies are used), Vitezica et al. [48] proposed an adjustment factor (ρ) to match averages of G to averages of A 22 . This adjustment was crucial to avoid bias in ssGBLUP evaluations, especially in populations under selection. It can be calculated as: where n is the number of elements in A 22 and G. The new G is constructed as G* is the adjusted genomic relationship matrix, 1 is a vector of ones, and ρ is Wright's F ST , which models the difference between pedigree and genomic base by implicitly fitting a constant µ, unlike in Hsu et al. [49] where the constant is fit explicitly.
When ssGBLUP was first implemented [16] in the BLUPF90 family of programs, A −1 was constructed based on Henderson [50] and Quaas [51] assuming no inbreeding, a frequently-used approximation [52,53], G −1 was constructed based on VanRaden [9], and A −1 22 was based on Colleau [54] and fully considered inbreeding. As the algorithms to construct G −1 and A −1 22 implicitly consider inbreeding, but not the algorithm for A −1 , H −1 was often ill-conditioned because of the unbalance between A 22 (i.e., the portion of A −1 for genotyped animals) and A −1 22 , which has larger coefficients due to inbreeding. This would lead to convergence problems and overestimation of GEBV. To solve this problem, scaling factors to decrease the amount of information in A −1 22 (ω) and to increase in G −1 (τ) were proposed [16,55]: Primarily, ω controls inflation due to incompleteness of pedigree and τ controls additive genetic variance [56]. The ω parameter was usually set to 0.7 for beef and dairy cattle ssGBLUP evaluations, and from 0.5 to 0.8 for pig evaluations. The appropriate value depended on the reduction of overestimation, which was evaluated based on validation studies. However, in 2016 the BLUPF90 developers observed that when inbreeding was considered in A −1 by adding an extra option to the renumf90 parameter file (see below), the need for ω lower than 1 was reduced. It was rather surprising that ignoring inbreeding in the set-up of A −1 , which is harmless in BLUP applications, had such a great impact in ssGBLUP. In fact, when genotyped animals have complete pedigree, τ and ω are likely to be equal to 1. Therefore, the compatibility among A −1 , G −1 , and A −1 22 is the key to avoid the use of ad-hoc scaling parameters while keeping GEBV with an acceptable level of inflation/deflation. To ensure consideration of inbreeding in the set-up of A −1 the lines INBREEDING pedigree need to be included in the parameter file for renumf90, and then the genetic effect in the parameter file for blupf90 needs to be RANDOM_TYPE add_an_upginb

Changing Blending, Tuning, and Scaling Parameters in blupf90
By default, in the blupf90 the blending parameter α is set to 0.95, which makes 1−α (or β) equal to 0.05. This is used to overcome singularity problems (i.e., G being non-positive definite). Using lower values for α can speed up convergence, with small or no impact on accuracy. To change α and β in blupf90, assuming the new values would be 0.90 and 0.10, the following option can be added to renf90.par: To model the difference between pedigree and genomic base, which is very important to reduce bias in GEBV, the default in the genomic library is to adjust G as proposed in Chen et al. [57]: To change the adjustment of G to the one proposed by Vitezica et al. [48] and demonstrated in Equation (38), the following option can be added to the blupf90 parameter file (e.g., renf90.par):

OPTION tunedG 4
A total of four different adjustments are implemented in the BLUPF90 family of programs; however, types 2 [57] and 4 [48] are more frequently used. To see other options, check this link: http://nce.ads.uga.edu/wiki/doku.php?id=readme.pregsf90.
If GEBV are underestimated/overestimated, ad-hoc scaling factors can be used to control the amount of information in A −1 22 (ω) and in G −1 (τ). The default in blupf90 is ω = τ = 1. To change those values, an option can be added to the blupf90 parameter file. Supposing only ω is to be changed to 0.95, whereas τ is still 1: Values of ω smaller than 1 helps to avoid overestimation; however, caution is recommended when using this option. A careful investigation of coefficients of the regression of a benchmark variable on GEBV in cross-validation studies is recommended when the objective is to choose an appropriate value.

Estimating SNP Effects in ssGBLUP
Even though ssGBLUP is a genomic relationship-based method and provides GEBV as final output, SNP effects can still be calculated in this method. This is because GBLUP is equivalent to SNP-BLUP [9] as u = Za and Var(u) = Var(Za). Using this idea, the selection index equation for GBLUP can be represented by:û where R is a diagonal matrix with (heterogeneous if needed) residual variance. Ifû â = Zâ , replacing the first G by Z , weighted by the ratio of SNP to additive direct variances (i.e., k = σ 2 a /σ 2 u ), would allow the calculation of SNP effects (a) [9]: As we saw before, σ 2 a = σ 2 u /2 p i (1 − p i ). Therefore, k can be reduced to 1/2 p i (1 − p i ). Assuming that: therefore,û = Gw (44) In this way, Finally, the SNP effects can be calculated as in (46): as Var(a) = D, a diagonal matrix of SNP variances, the conditional mean of SNP effects given the GEBV is: Thus, given GEBV from ssGBLUP are available, SNP effects are calculated as [58]: If SNP effects are available, indirect predictions (IP) can be calculated for young genotyped animals in between official ssGBLUP evaluations, as the sum of SNP effects weighted by gene content [18]. Except for the small portion of remaining pedigree variation, they are identical to GEBV [18]. Indirect predictions may also be useful for genotyped animals that have incomplete pedigree. Such animals can increase bias and reduce reliability of GEBV if included in official ssGBLUP evaluations, given that A is poorly constructed for them and results in incompatibilities with G (which is correct) [56]. Additionally, if lots of animals are genotyped, say weekly, but they do not contribute to the evaluation (as in young animals that do not have phenotype or progeny yet), having IP for them reduces computing costs.
Another feature of having SNP effects is the ability to account for the fact that SNPs explain different proportion of genetic variance on the trait, and this leads to iterative methods similar to Bayesian regressions (i.e., BayesA, BayesB). An iterative method was proposed to add different weights for SNPs under ssGBLUP, which is called weighted ssGBLUP [58]. In this method, seven steps are needed:

1.
Set the diagonal matrix of SNP variance or weight as an identity, D = I 2.
Estimate SNP variance for SNP i e.g., as d i = a 2 i (i.e., quadratic weight) 6.
Iterate from 2 until changes in SNP variance are small across iterations Usually, the best weights are obtained after one to two rounds. Different formulas can be used to calculate SNP variance, but all of them are approximations. Several authors have reported decrease in GEBV accuracy and increase in bias over iterations [59,60] when variance is calculated based on squared SNP effects, especially for more polygenic traits. This is because SNP variance would reach extreme values over iterations. VanRaden [9] proposed and successfully applied in US dairy cattle a formula to calculate SNP variance that limits the change over iterations, avoiding extreme values. This method is called non-linearA: where CT is a constant that determines the departure from normality; |â i | is the absolute estimated SNP effect for marker i, and σ â is the standard deviation of the vector of estimated SNP effects.
Garcia et al. [26] and Fragomeni et al. [61] showed that non-linearA had good convergence properties and avoided extreme values. The maximum change in variance is usually limited by the minimum between 5 and the exponent of CT; whereas CT was empirically derived as 1.125 over several polygenic traits for dairy cattle populations [9], meaning the distribution for SNP effects approaches a t distribution with large degrees of freedom, e.g., approaching a normal distribution. Considering SNP variances when constructing G in ssGBLUP seems to improve the accuracy of predicting GEBV for data sets with small number of genotyped animals, but marginal or no improvement was observed for large genotyped populations (i.e., >10 k genotyped animals) [60], even for less polygenic traits. If the data allows to accurately estimate SNP effects, there is no advantage in selecting SNPs and tagging chromosome segments differently. The fact that SNP selection does not improve accuracy with large datasets benefits commercial evaluations that use multiple-trait models, as models with different SNPs per trait are easy to implement for single-but not multiple-trait models [62].
Once the variance for each SNP is calculated, the proportion of additive genetic variance can be plotted for all SNPs in a Manhattan plot. A threshold of 1% of genetic variance can be assumed if the objective is to explore associations between traits and regions in the genome, like in genome-wide association studies (GWAS).
More formally, and according to the common use in ambitious GWAS studies, p-values for SNPs can be calculated as [63][64][65]: where Φ is the cumulative standard normal function and sd(â i ) is the square root of prediction error variance (PEV) of the i-th SNP effect. Prediction error variance for each SNP effect can be calculated as [65]: where C u 2 u 2 is the portion of the inverse of the LHS of MME for ssGBLUP (34) referent to genotyped animals; b is (1-ρ/2), which is a function of the tuning parameter in (38); k and α were defined previously.

Using postGSf90 to Compute SNP Effects, SNP Variances, and p-Values
If the objective is to backsolve GEBV to SNP effect and then calculate variance explained by SNPs, postGSf90 can be used. This software was primarily developed to serve this purpose, but recently was modified to also compute p-values for SNPs [63]. As this software relies on GEBV to calculate SNP effect and variance, blupf90 needs to be run first with three additional options in renf90.par:

OPTION saveGInverse OPTION saveA22Inverse OPTION snp_p_value
The first and second options save G −1 and A −1 22 , respectively, and the third option saves C u 2 u 2 , all in binary format. After running blupf90, renf90.par can be copied with another name, for example postgs.par, and the additional options are now: The first two options are to read G −1 and A −1 22 , respectively; the third option is used now to calculate p-values. A fourth option (i.e., OPTION windows_variance x) can be used if variance for SNP is to be calculated based on windows of x SNPs instead of individual SNP [58]. Based on Equation (48), postGSf90 needs GEBV, SNP content, and G −1 to compute SNP effect and variance. The first one is obtained from the blupf90 solutions file, the second from the SNP file, and the third from a file blupf90 created and named Gi. For the calculation of p-values, a file containing C u 2 u 2 was created by blupf90 and named xx_ija. After running postGSf90, several files are generated. One is snp_sol, which the column information is described in Box 4. Three extra files are chrsnp, chrsnpvar, and chrsnp_pval, which are used to generate Manhattan plots for SNP effect, proportion of variance explained by n adjacent SNPs, and -log10 (p-value), respectively. Additionally, R and gnuplot scripts are also generated to create the Manhattan plots described above. Box 5 shows how to generate Manhattan plots in R and gnuplot. By default, the maximum change in SNP variance is limited to 5, which is calculated as CT (5−2) and returns a value of 1.4238 with CT = 1.125. If this limit is to be changed to 10, the following option can be used, where the value provided (x) is the result of the expression CT (x−2) . As an example, if CT is 1.05 and x is 10, the value provided to the option should be 1.4775:

OPTION SNP_variance_limit 1.4775
A parameter file to run postGSf90 using non-linearA variance with CT equal to 1.05 and limit of 10, and computing p-value is in Box 6. Although the calculation of SNP effect and variance was designed to be an iterative method, it is not recommended to use the iterative process when using the option to calculate p-value [63]. To check how to have weighted ssGBLUP where SNP effect, SNP variance, and GEBV are updated in an iterative way, see Appendix D.

Accounting for Sequence Variants in ssGBLUP
Genomic selection relies on linkage disequilibrium (LD) between SNPs and quantitative trait nucleotides (QTNs). By having dense SNP panels (i.e., >50,000 SNP), it is more likely that a QTN will be in LD with at least one SNP. If QTN A is linked to SNP B, depending on the strength of this linkage, once SNP B is observed it will imply QTN A was inherited together. Therefore, it is expected that increasing the number of SNPs the accuracy of genomic selection will increase. VanRaden et al. [66] showed an average increase of 1.6% in reliability of GEBV for a simulated trait when using 500,000 instead of 50,000 SNPs. According to Meuwissen et al. [67], the ideal SNP density is given by whole-genome sequence data. As millions of SNPs are screened, the causative variants are expected to be among them. However, the use of high-density SNP chips did not increase accuracy from medium-size chips, and use of sequence data is showing only marginally higher accuracies.
Using simulated data, Fragomeni et al. [68] showed that accuracy of GEBV in weighted ssGBLUP can approach 1 (i.e., perfect genomic prediction) if all causative variants are known and the true variance is assigned to each one of them. In a US Holstein dataset, Fragomeni et al. [61] tested the performance of ssGBLUP when using nearly 54,000 SNPs and when adding 17,000 significant variants discovered in a GWAS (pre-selected sequence SNPs) that involved 33 traits [69]. Although VanRaden et al. [69] reported an increase in reliability of GEBV of 4.3 points for stature by using non-linearA weights in a multistep scenario, no gain was observed in Fragomeni et al. [61] using either quadratic or non-linearA weight in ssGBLUP. This is possibly because the amount of data used in ssGBLUP overwhelms any a priori assumption made about SNP effects, making this method less sensitive to SNP weighting in the presence of large data. Another hypothesis to explain the steady reliability is that not all causative variants were present among the 17,000 significant SNPs. Although causative variants can be included in ssGBLUP assuming different weights for SNPs, maximizing the accuracy of GEBV would require the true identification of all causative variants, their substitution effect, their position, and the proportion of additive genetic variance they explain. To identify some of the causative variants, a large number of sequenced animals with phenotypes is needed. When a large amount of information is available, the accuracy may be high enough; therefore, improvements from the incorporation of causative variants are likely to be small for large data sets. When the number of genotyped animals is larger than the number of independent chromosome segments, the accuracy is maximized without SNP weighting/selection [70][71][72].
An optimal algorithm for finding causative SNPs would be to assign a large variance to those causative SNPs while reducing the variance of nearby SNPs or setting it to 0. This would resemble methods that perform a sequential estimation of SNP effects/variances. In such methods, SNPs close to causative variants are automatically disregarded. Although there are three different weighting methods implemented in BLUPF90 programs, any external weight can be considered, including the ones from Bayesian regressions. The only requirement is that those external weights have to be rescaled to sum to the number of SNPs used in the model. However, Gualdron-Duarte et al. [73] found that improvements in predictivity from unweighted GBLUP to BayesR and other methods were similar to improvements from unweighted ssGBLUP to weighted ssGBLUP with external weights. To check how to consider different weights or variance for causative variants in ssGBLUP, see Appendix D.
The default configuration of preGSf90, blupf90, and postGSf90 assume a maximum number of SNPs equal to 400,000. In the presence of sequence data (or over 400,000 SNPs), an extra option is required: OPTION maxsnp x where x is the new maximum number of SNPs. Although the most common way to include pre-selected sequence SNPs is to add them to the current SNP panel, it is possible to have an analysis where both are considered as separate components. In such a case, there is a need to fit two animal effects into the model-one for the current SNP panel and one for the pre-selected sequence SNPs [74]. However, the gains in accuracy using GBLUP with two animal effects was limited [74,75]. There are no reports on the literature about the use of SNPs from a panel and pre-selected from sequence fitting two H in ssGBLUP. Accommodating two G (GBLUP) or two H (ssGBLUP) is possible using BLUPF90 software, although this may have the same impact on accuracy as using different weights for pre-selected sequence SNPs. Additionally, a strong assumption of no correlation between the two random animal effects has to be assumed.
To accommodate two genomic matrices in blupf90, the inverse of those two matrices should be constructed separately using preGSf90 and saved to a file. The preGSf90 has an option to save H −1 but not H, because the latter is never constructed:

OPTION saveHinv
This option saves the diagonals and upper diagonals of H −1 as a plain text file (Hinv.txt) in the format of row, column, and value, where row and column are based on renumbered IDs. When using blupf90, the random type should be set as user_file (see RANDOM_TYPE in Appendix B) and the file name has to be provided (e.g., Hinv_chip.txt, Hinv_seq.txt, . . . ). The random type user_file should be used as many times as the number of different SNP sets were used to compute H −1 (Box 7).

Large-Scale Genomic Evaluations with ssGBLUP
The most expensive operation in ssGBLUP, as implemented in Aguilar et al. [16] and Christensen et al. [17], is the inversion of G and A 22 . This operation has an approximately cubic cost with the number of genotyped animals. With efficient computing algorithms, matrix inversion is feasible for up to 100,000 genotyped animals [76,77]. The number of genotyped animals in some livestock species goes far beyond 100,000 and considerably increases every year. One example is the American Angus Association that has over 780,000 (Steve Miller, 2020; personal communication) and the US dairy industry has already collected over 3.4 M Holstein genotypes (https://queries.uscdcb.com/Genotype/counts.html), where only 11% of those are for males, over 75% are for animals without a BLUP evaluation, and there is a very slow increase in the number of genotypes for proven bulls [78].
To overcome the limitation set by the number of genotyped animals in ssGBLUP, Misztal et al. [79] proposed the algorithm for proven and young (APY) to construct G −1 without having to explicitly invert G. The APY is based on the principles discovered by Henderson [50] and Quaas [51] to recursively construct the inverse of A. The logic behind the construction of G −1 APY is that the genotyped animals are split into core (c) and noncore (n), and the main assumption is that breeding values for noncore animals (u n ) are functions of breeding values of core animals (u c ): where P nc is a matrix that relates breeding values for noncore to core animals, and Ψ n is a diagonal matrix with estimation errors. Following further developments [80], G −1 APY can be constructed as: with m nnn ii = g ii − g ic G −1 cc g ci . The APY algorithm creates a generalized sparse inverse of G at approximately a linear cost in computing and storage [79,80] and has been extensively tested for beef cattle [18], dairy cattle [81,82], and pigs [83,84]. This algorithm enables ssGBLUP evaluations with millions of genotyped animals, as the only inverse needed is for the core animals. Pocrnic et al. [85] and Pocrnic et al. [86] found that the ideal number of core animals depends on the dimensionality of genomic information. Even though millions of animals can be genotyped, the amount of independent genomic information or independent chromosome segments is limited and depends on the effective population size (Ne) and genome length. The knowledge about this non-redundant information enables computations with large-scale genomic data. Pocrnic et al. [86] found that the minimum number of core animals was around 4000 for pigs and chicken, 11,000 for Angus, 12,000 for Jerseys, and 14,000 for Holsteins.
If G −1 APY is efficiently computed but A −1 22 is not, ssGBLUP cannot be used for over 100,000 genotyped animals. To avoid explicit inversion of A 22 , Stranden et al. [87] and Masuda et al. [88] proposed to compute an efficient inverse indirectly as a product of sparse matrices: where A 11 , A 21 , and A 22 are portions of A −1 for non-genotyped, between genotyped and non-genotyped, and for genotyped animals, respectively. Computing time for constructing A −1 22 for 570,000 genotyped animals extracted from a population of 10 M animals was around 11 min [88]. Single-step GBLUP with G −1 APY and efficient A −1 22 was successfully applied to over 10.9 M cows with milking records, 13.5M animals in the pedigree, and about 2.3 M genotyped Holsteins [89]; using 15,000 core animals, the complete evaluation for a model with 18 type traits took four and a half days to converge. Within this time stamp, the construction of G −1 APY and A −1 22 took one day. In fact, this time depends on the total number of genotyped animals and the core group size.
Unfortunately, the subroutines to create G −1 APY and efficient A −1 22 are not implemented in the free distribution of BLUPF90 family of programs.

Unknown Parent Groups (UPG) and Metafounders in ssGBLUP
Commercial populations, especially sheep, beef and dairy cattle, often have incomplete pedigrees. In BLUP, missing parents are modeled by UPG [51,90,91]. Such groups are also known as phantom parents or genetic groups and are used to represent the average level of breeding value in a group where parents were missing. Different groups can be assigned based on year of birth, sex, breed combination, etc. As UPG are mainly modeled as fixed effects, they need to be defined carefully to be estimated accurately and to avoid confounding with other effects in the model [51]. In ssGBLUP, when UPG are applied only to pedigree relationships, the convergence rate can be slow [92]. Misztal et al. [93] revised UPG equations to include groups also in the genomic portion of H −1 , which then becomes H −1 * , based on Quaas-Pollak (QP) transformation [90]: where Q 2 is a matrix that relates genotyped animals to groups; G −1 can be replaced by G −1 APY in large genotyped populations. When UPGs were applied to all components of H −1 , convergence dramatically improved for a multiple-trait model in the Nordic dairy cattle population [94]. Revised UPGs also worked well for the US Holstein data up to 2014 [56]. However, using data updated to 2015, Masuda et al. [95] reported lower prediction reliabilities using revised UPG than not using UPG at all. Therefore, it is not clear whether ssGBLUP equations should include UPG for G −1 , as genomic relationships are not affected by missing pedigree, implying UPG are automatically accounted for. Tsuruta et al. [96] showed that UPG for G −1 were not estimable for young genotyped animals in an 18-type trait genomic evaluation.
Current use of UPG in BLUP ignores the fact they represent sets of related, missing parents in a population under constant selection. Thus, a more accurate modelling would assume missing parents can be related and inbred [97,98]. Legarra et al. [99] proposed the idea of metafounders, which are "inbred and related" UPG. In ssGBLUP, the genomic relationships are usually derived based on current allele frequencies and scaled for compatibility with pedigree relationships as in Vitezica et al. [48]. Based on the metafounders theory, G would be derived using 0.5 allele frequencies as an "absolute reference" [100], and A would be scaled for compatibility with G using covariances among and within metafounders. According to Legarra et al. [99] the covariances represent size of the base population at the time when pedigree recording started and they would be estimated in such a way so that they account for scaling, unaccounted inbreeding, and different genetic level (i.e., when using multibreed or selected populations). Several methods were proposed to estimate the covariances among metafounders, including via gene frequencies related to unknown parents [101]. In simulations and real data, the concept of metafounders delivered the least biased predictions [38,101]. In ssGBLUP, H −1 with metafounders (H Γ−1 ) can be represented by: where Γ is the relationship matrix among metafounders, which can be estimated using generalized least squares [101]. Once Γ is inverted, Henderson [50] rules can be used to construct the inverse of the pedigree relationship matrix. As metafounders are based on UPG, estimating Γ strongly depends on how the groups are assigned. Estimating Γ is an active area of research as many UPGs are weakly related to genotyped individuals.
In the BLUPF90 family of programs, renumf90 can create UPG based on year of birth or can recognize negative values in the pedigree as UPG (see Appendix B). If blupf90 is used to run ssGBLUP, UPG will be set only for A −1 . To set UPG for the full H −1 like demonstrated in (54), the following extra option is needed:

OPTION exact_upg
To set UPG only for A −1 and A −1 22 , a second option is needed to remove UPG for G −1 : OPTION TauOmegaQ2 0.0 1.0 As the metafounders concept is still recent, the BLUPF90 developers are currently working on a standalone software to estimate Γ, which is called gammaf90. After that, blupf90 will be modified to accept an extra type of random effect specific for metafounders. Independent software used in Garcia-Baccino et al. [101] to estimate Γ and to compute H Γ−1 with instructions and examples can be found here https://github.com/alegarra/metafounders. After H Γ−1 is constructed, blupf90 can be used with the random type set as user_file (see RANDOM_TYPE in Appendix B).

Conclusions
The BLUPF90 is a complete software suite for the most common computations needed in animal breeding and genetics. The programs are highly optimized and have been under constant development for over 20 years. Single-step GBLUP, which is one of the main tools for genomic analysis, was first implemented in 2009. Since then, several changes were made to make ssGBLUP flexible to any model, number of traits, number of phenotypes, number of genotyped animals, and sequence data. Single-step GBLUP is fully supported in the BLUPF90 software suite and has been used for genomic evaluations worldwide.  Genotypes were coded based on the number of copies of the alternative allele (0, 1, 2) and are in a file named genotypes.txt, with: animal and SNP_genotype. The last file (gen_map.txt) contains the map for SNPs: SNP identification, chromosome number, position (in base pairs).

Appendix B.
Renumbering the Data with renumf90: The BLUPF90 software suite only works with numeric entries (i.e., integer or real) and levels of all effects need to be consecutive starting from 1. In field datasets, animal ID contains alphanumeric characters and levels of fixed effects are combinations of two or more effects (i.e., contemporary groups). To avoid extra work with sorting and renumbering all effects using independent scripts, renumf90 can be used to renumber the data. This software creates a renumbered phenotype (renf90.dat) and pedigree files (renaddXX.ped; where XX refers to the number of the animal effect in the model), along with a cross-reference table for fixed effects (renf90.tables), a cross-reference file for IDs of genotyped animals (name_of_snp_file_XrefID), and a file with inbreeding coefficients if inbreeding is used to compute A −1 . One interesting feature of renumf90 is that it can trace pedigree back n generations for animals in the data and/or SNP file, removing (pruning) uninformative animals.
The renumf90 requires a parameter file that consists of keywords (capital letters) and the corresponding values. There are 6 mandatory keywords: DATAFILE, TRAITS, FIELDS_PASSED TO OUTPUT, WEIGHT(S), RESIDUAL_VARIANCE, and EFFECT (Table A1). If there is no need to use FIELDS_PASSED TO OUTPUT and WEIGHT(S), simply put an empty line as a value. The EFFECT keyword has several values that are described in Tables A2 and A3: Where position means the column number for the effect being described; effect type is cross for cross-classified effect and cov for covariables. If cross-classified, a data type alpha or number is required to describe variables created based on alphanumeric or only numeric characters, respectively.
Optional keywords also exist and if used, should follow a specific order. For example, if an effect is random, the keyword RANDOM and its value should follow the EFFECT description. Table A3 has the possible optional keywords. A full description can be found in the BLUPF90 manual (http://nce.ads.uga.edu/wiki/lib/exe/fetch.php?media=blupf90_all7.pdf): The following parameter file can be used in renumf90 to renumber the data described in Section 2.4, following the model y = sex + u + residual, which considers sex as fixed and u (i.e., animal effect or direct additive genetic effect) as random (remember that phenotypes.txt contains five columns: animal, sex, phenotype, true breeding value, Hint 3: Save the parameter file above (e.g., parameter1.par). To run renumf90 and save the screen output to a file, use the following command line: renumf90 parameter1.par | tee out.log Hint 4: OPTION map_file is not mandatory. It is used to provide the program the name of the SNP map file. This option will be passed by renumf90 to the new parameter file without being used. The map file should contain a header indicating the columns for the SNP identification (SNP_ID), chromosome (CHR), and the position of the SNP in base pairs (POS).
The renumbered output files and contents are: (1) renf90.dat-is the renumbered phenotype file and contains three columns: phenotype, renumbered sex code, and renumbered animal ID; (2) renadd02.ped-is the renumbered pedigree file and contains 10 columns: (3) renf90.tables-is a file with correspondence table between the original code for fixed effects and the renumbered value. It is organized into three columns: code, number of observations, and renumbered value. (4) renf90.inb-contains the animal original ID and the inbreeding coefficient. (5) genotypes.txt_XrefID-is a cross-reference file with renumbered ID and original ID. This file is created to avoid editing the SNP file, which is usually big and requires a lot of memory. By default, the name of this file is a concatenation of the name of SNP file and the suffix "XrefID", which means cross-reference ID. The parameter file generated by renumf90 is also based on keywords and values that are described in Table A4: Quality Control of Genomic Data with preGSf90: This software is an interface for the genomic library and contains several options (http://nce. ads.uga.edu/wiki/doku.php?id=readme.pregsf90). One useful task is to perform a quality control of the genomic data, removing SNPs with low minor allele frequency (MAF) and monomorphic, SNPs departing from the Hardy-Weinberg Equilibrium, and SNPs with low call rate (i.e., missing in several samples). Checks for animals are also done, which includes removing animals with low call rate (i.e., several SNPs are missing) and animals with Mendelian conflicts (i.e., parentage verification). One option allows preGSf90 to save the clean SNP and SNP map files: OPTION saveCleanSNPs. As preGSf90 also constructs G and A 22 , respective inverses, and G −1 − A −1 22 , some extra options can be used to force the program to perform only quality control, avoiding the creation of the inverses. The options are OPTION createGInverse 0, OPTION createA22Inverse 0, OPTION createGimA22i 0. Therefore, the parameter file to perform only quality control in preGSf90 can be constructed by adding extra options to renf90.par. To run preGSf90 and save the screen output to a file, use the following command line: preGSf90 renf90.par | tee preGSout.log If a SNP map file is provided, three new files are generated by preGSf90: genotypes.txt_clean, geotypes.txt_XrefID_clean, and gen_map.txt_clean, which contain the information after removing SNPs and animals that did not pass the quality control. If preGSf90 is used to do the quality control and save clean files, the parameter file for the subsequent run of blupf90 should include the name of the clean file and an option to avoid running the quality control again:

DATAFILE
OPTION SNP_file genotypes.txt_clean OPTION map_file gen_map.txt_clean OPTION no_quality_control The preGSf90 software can also be used to calculate linkage disequilibrium, to do single value decomposition of the SNP file, to plot the first two principal components for population structure checks, to calculate heritability of gene content, and to save relationship matrices in text or binary formats, including H −1 . As H is not needed in ssGBLUP, this matrix cannot be created using preGSf90.

Appendix D.
Iterative Weighted ssGBLUP with blupf90 and postGSf90: Variance or weights for SNPs can be used to construct the genomic relationship matrix when a diagonal matrix of weights is included in the equation to create G [9], as G = ZDZ 2 p i (1−p i ) . If G can be updated with weights, the fact SNPs explain different proportion of variance can be extended to GEBV. If weights are proper, GEBV accuracy may increase, but this increase depends on data structure. Large genomic data seems to do not benefit from different SNP weighting [60], whereas small genomic data may produce extreme SNP variances.
To iteratively calculate and use weights to update GEBV and SNP effect/variance, it is recommended to use the non-linearA option to avoid extreme SNP variance [26,61]. Assuming the data is renumbered (see Appendix B) and updates are done for GEBV and SNP effect/variance, blupf90 and postGSf90 should be run consecutively until changes in SNP variance or GEBV between the previous and current iteration are small [61]. The parameter file for blupf90 should include extra options to save G −1 (OPTION saveGInverse) and A −1 22 (OPTION saveA22Inverse), to use weights for SNPs (OPTION weightedG weights.txt), and to avoid quality control (OPTION no_quality_control). Avoiding quality control in the iterative run reduces computing time and keeps the number of SNPs constant in consecutive iterations. The file weights.txt contains a column with weights for SNPs, where the number of lines is the number of SNPs. In the first iteration, this file contains a column of ones. The parameter file for blupf90 becomes:  After running postGSf90, the new SNP weight or variance will be the column number seven of snp_sol. Save this column as weights.txt and start a new iteration of blupf90 and postGSf90 until changes in SNP variance or in GEBV are small. If there is a divergence of variances, the first iteration should be used (e.g., no weights). If there is a desire to use external weights, they should be used instead of the column seven of snp_sol. The external weights have to be rescaled to sum to the number of SNPs used in the model.
Sometimes during the iterations, blupf90 outputs a warning "correlation for off-diagonals G and A 22 is lower than 0.5", especially when the default formula to calculate weights is used (i.e., d i = 2p i (1 − p i ) a i 2 ). This is because using weights for G can create some extreme values, lowering the correlation with A 22 . This correlation is expected to be from 0.5 to 0.9, where values greater than 0.9 indicate information in G and A 22 is very similar; therefore, a small gain in accuracy is expected by using genomic information. Low correlation in the first iteration of weighted ssGBLUP may be a sign of misidentified or low-quality genomic samples.