A study of Genomic Prediction across Generations of Two Korean Pig Populations

Simple Summary Commercial genotyping has become accessible at a relatively low cost and nowadays it is widely used by breeders to predict production and economic traits. Many studies explored the benefits of using DNA information in breeding programs, and many methods have been established to optimize the use of such information. To date, however, very few studies have explored how prediction accuracies change across generations. Here we present a short evaluation across five generations in two pig breeds and evaluate the accuracy of the prediction of relevant production traits using different generational groups. Abstract Genomic models that incorporate dense marker information have been widely used for predicting genomic breeding values since they were first introduced, and it is known that the relationship between individuals in the reference population and selection candidates affects the prediction accuracy. When genomic evaluation is performed over generations of the same population, prediction accuracy is expected to decay if the reference population is not updated. Therefore, the reference population must be updated in each generation, but little is known about the optimal way to do it. This study presents an empirical assessment of the prediction accuracy of genomic breeding values of production traits, across five generations in two Korean pig breeds. We verified the decay in prediction accuracy over time when the reference population was not updated. Additionally we compared the prediction accuracy using only the previous generation as the reference population, as opposed to using all previous generations as the reference population. Overall, the results suggested that, although there is a clear need to continuously update the reference population, it may not be necessary to keep all ancestral genotypes. Finally, comprehending how the accuracy of genomic prediction evolves over generations within a population adds relevant information to improve the performance of genomic selection.


Introduction
Genomic models that incorporate dense single nucleotide polymorphism (SNP) marker information are widely used for the prediction of genomic values [1] to select animals and plants in breeding programs [2][3][4][5] or to predict susceptibility to diseases in humans [6,7].
Within a genotyped population, the extent of the genomic relationships [8,9], linkage disequilibrium (LD), and co-segregation of QTL with markers [10] contribute to the accuracy of genomic predictions. In livestock populations such as beef and dairy cattle, sheep, and pigs, the level of relatedness between individuals is higher compared to the relatedness observed in other populations, such as humans.
Genomic relationships are higher within families, rather than across families, and genomic prediction in breeding populations generally relies on the genotypes and phenotypes from ancestral

Materials and Methods
A genetic model assuming additive SNP-effects was used for the genomic prediction of five traits (daily weight gain, weight at 90 days of age, average back fat, back fat depth, and meat percentage) in two populations of Korean pigs (Landrace (LL) and Yorkshire (YY), evaluated separately). Animals were genotyped on the Illumina PorcineSNP60 array (http://www.illumina.com). Quality control of the genotypes was performed with the program snpQC [13]. Individuals with call rates <0.9 were excluded from the data set, and only autosomal SNPs were considered for the analyses. SNPs with minor allele frequencies <0.01, with GC (GenCall) scores <0.6, or that failed in more than 5% of the samples were excluded from the genotype data. After quality control 47,225 SNPs remained for a total of 8202 individuals (3149 LL and 5053 YY). Missing SNP (<0.01%) were imputed with Eagle [14] using the software's default parameters. Seven discrete generations (generation zero to generation six) were available in the populations, but generations four to six were combined into a single generation (4)(5)(6), due to the small number of individuals in generations five and six. Table 1 summarizes the number of individuals by breed per generation.
The phenotypes were pre-corrected for fixed effects (overall mean, sex, company, and parity) using a linear regression model, y original = Xb + e, in which b were the coefficients for fixed effects and X their design matrix, and the pre-corrected phenotypes were defined as y = y original − Xb. Then, as previously stated, we assumed an additive genomic model y = Ma + ε for the prediction of breeding values, in which y were the pre-corrected phenotypes, M the SNP genotypes, a ∼ N 0, Iσ 2 a the additive random SNP effects, and ε ∼ N 0, Iσ 2 ε the random residuals. The breeding values were defined as g = Ma and the genomic relationship matrix (GRM) as G = MM / m j=1 2p j q j [15], with g ∼ N 0, Gσ 2 g , p j s were the minor allele frequencies of the SNP genotypes (q j = 1 − p j ), and σ 2 g = m j=1 2p j q j σ 2 a . Heritabilities (h 2 = σ 2 g / σ 2 g + σ 2 e ) for the entire dataset and for each generation group (Table 2) were estimated using the restricted maximum likelihood (REML) [16] and breeding values were estimated using GBLUP [15]. All analyses were coded and conducted with the R statistical programming language. Along with the genomic evaluations across generations, we evaluated the genomic relationships between reference and test populations across these generations. For this evaluation, we calculated the mean values of the GRM between the reference and the test individuals (mean(G re f ,test )), and we also calculated a genomic correlation between reference and test populations with a singular value decomposition (SVD) of the genotype matrix [17]. For this genomic correlation, we assumed SVD(M) = UDV and R = n test /n re f V test V re f D re f , with SVD(R) = U R TV R . R is a matrix that correlates the reference and the test individuals through their genotypes and accounts for their variance components and differences in population sizes. The prediction accuracy reaches its maximum ( √ h 2 ) when R and D test are similar matrices, and this similarity can be summarized as r = a + b, such that a and b are the coefficients of the regression T ∼ D test . Because R is balanced for different population sizes, it is expected that −1 < a < 0 and 0 < b < 1. Thus, r = a + b is the correlation between reference and test populations using the SVDs of M re f and M test [17].
Prediction accuracy and the relationships between reference and test populations were obtained when the reference population was not updated (maintaining population zero as the reference for all subsequent generations), as well as when individuals from only the immediately previous generation were used as the reference population. A cumulative reference population with all individuals from previous generations was used to serve as the baseline (best case scenario). For each scenario, all animals available within these groupings where included in the analyses, there was no data sub-setting or resampling. Prediction accuracy trends across generations were then compared to the relationships between reference and test populations across generations.

Results
We performed a brief analysis of the two pig populations (breeds LL and YY) and their generations, regarding the principal components of the GRM, and the phenotypic values. Figure 1 presents the first two principal components, that represent 27% and 3.2% of the total variance, respectively, by breed and generation. With the principal components analysis (PCA), we can observe a clear separation between the two breeds. Within breeds, the distinction between generations is less discrete, although a slight trend is discernable. The across generation variance for LL was 5.6% and for YY it was 9.2%, for the sum of the first two principal components of the GRM, when the PCA was performed on each breed independently.

Results
We performed a brief analysis of the two pig populations (breeds LL and YY) and their generations, regarding the principal components of the GRM, and the phenotypic values. Figure 1 presents the first two principal components, that represent 27% and 3.2% of the total variance, respectively, by breed and generation. With the principal components analysis (PCA), we can observe a clear separation between the two breeds. Within breeds, the distinction between generations is less discrete, although a slight trend is discernable. The across generation variance for LL was 5.6% and for YY it was 9.2%, for the sum of the first two principal components of the GRM, when the PCA was performed on each breed independently.        Table 2 presents the heritabilities obtained with REML by breed and generations, for the five traits (daily weight, weight at 90 days, average back fat, back fat depth, and meat percentage). The standard deviation of heritability estimates ranged from 0.025 to 0.048. Figure 3 presents the prediction accuracies obtained by breed and generations, for the five production traits (daily weight, weight at 90 days, average back fat, back fat depth, and meat percentage). We observed distinct levels of prediction accuracy when different generations were considered as the reference population. When generation zero was used as the reference to all subsequent generations, we observed a loss in prediction accuracy of the next generations for all traits, on both breeds. When the immediate previous generation or all the previous generations were used as reference populations, we observed an increase in prediction accuracy for all traits, in both breeds.  Table 2 presents the heritabilities obtained with REML by breed and generations, for the five traits (daily weight, weight at 90 days, average back fat, back fat depth, and meat percentage). The standard deviation of heritability estimates ranged from 0.025 to 0.048. Figure 3 presents the prediction accuracies obtained by breed and generations, for the five production traits (daily weight, weight at 90 days, average back fat, back fat depth, and meat percentage). We observed distinct levels of prediction accuracy when different generations were considered as the reference population. When generation zero was used as the reference to all subsequent generations, we observed a loss in prediction accuracy of the next generations for all traits, on both breeds. When the immediate previous generation or all the previous generations were used as reference populations, we observed an increase in prediction accuracy for all traits, in both breeds. Figure 4 presents the relationships between individuals in reference and test populations by breed and generations. We observed a distinction of the different generations considered as the reference population. When generation zero was fixed as the reference to all subsequent generations, we observed a decay in the relationships to further generations for all the traits, on both breeds. When the immediate previous generation was used as reference population, we observed that mean(G re f ,test ) remained stable across generation for both breeds, while r = a + b increased for YY, and remained stable for LL. When all previous generation were used as reference population we observed that mean(G re f ,test ) remained stable across generations for LL and increased for YY, while r = a + b increased for both LL and YY (with the exception of generation 4-6, on which there was a decrease of r = a + b for YY). ) and generations (one to four, the last one combining generations four to six). Results are presented for the different generations considered as the reference population to perform genomic prediction, to predict a test generation (t). Figure 4 presents the relationships between individuals in reference and test populations by breed and generations. We observed a distinction of the different generations considered as the reference population. When generation zero was fixed as the reference to all subsequent generations, we observed a decay in the relationships to further generations for all the traits, on both breeds. When the immediate previous generation was used as reference population, we observed that ( , ) remained stable across generation for both breeds, while = + increased for YY, and remained stable for LL. When all previous generation were used as reference population we observed that ( , ) remained stable across generations for LL and increased for YY, while = + increased for both LL and YY (with the exception of generation 4-6, on which there was a decrease of = + for YY).  . Prediction accuracies (ĉ or y test , g test , the correlation between real phenotypes in the test population, y test , and predicted breeding values, g test ) by breed (Landrace (LL) and Yorkshire (YY)) and generations (one to four, the last one combining generations four to six). Results are presented for the different generations considered as the reference population to perform genomic prediction, to predict a test generation (t).

Discussion
Our study aimed to evaluate the accuracy of prediction across generations, and possibly relate trends in prediction accuracy to patterns in the relationships between individuals from reference and test populations. We did observe that the closer the individuals from reference and test populations are in terms of generations, both relationships and prediction accuracy increase.
While the close relationship between test and reference populations is highly relevant to prediction accuracy, it is also well known that sample size, and genetic and phenotypic variation are important contributors to genomic prediction. As expected, we did observe that when aggregating all past generations to the reference population, prediction accuracy indeed is higher than when only the immediately previous generation features in the reference population. This was intentional in this study as we primarily wanted to contrast the decay in accuracy across generations, and by using all previous aggregated generation data we could set the baseline of "best possible prediction" which could then be used to compare close versus further apart generations.
This same behavior in prediction accuracy was also observed in the relationships between individuals, and even clearer in our measure based on the SVD of the genotype matrices. The gain in prediction accuracy, from aggregating all past generations is, however, greater than the gain in the relationship between individuals. This is most likely a result of an additional gain in the prediction accuracy due to the greater size of the reference.
We did observe some fluctuations in the prediction accuracies, and this is most probably related to sample sizes. The number of individuals available for each generation in our data set was not homogeneous, and this may be a cause for the observed fluctuations. In addition, the scenario that incorporated all previous generations in the reference population represents the best-case scenario, serving as a comparative baseline for the prediction accuracies. An interesting result observed was that for LL, with the exception of back fat average, the prediction accuracy of generation 4-6 was very similar when either all previous generations were incorporated in the reference population, or only the immediately previous generation was used as reference. The same was observed for YY, for daily weight and weight at day 90. This indicates that potentially, if the population evolves mostly within itself, ancestral generations can be discarded from the reference over time, without compromising prediction accuracy. The sample sizes of the LL breed were quite small when compared to the YY breed, and the phenotypic values of LL did fluctuate more because of the small sample size, potentially resulting in less reliable prediction accuracies.
Cross-generational persistency of prediction accuracy should be further explored in animal breeding. One topic that should be addressed in future studies is how many generations back should a reference population incorporate, to predict the subsequent generation. We did observe that when all previous generations are aggregated to train the model, both the relationships between individuals and prediction accuracies start to reach a maximum limit. We did not however, have enough generations in our data set to explore this in detail. This is particularly interesting, in the sense that it can give a perspective of how much information is necessary to aggregate to reference populations, without adding too many individuals from ancient generations that do not incorporate any relevant information and increase the computer burden of performing the prediction models. Another topic that should be explored in these studies is the behavior of prediction accuracy in comparison to the intensity of selection, and the incorporation of new individuals into a herd.
While trends were observed in the prediction accuracy of the traits evaluated over generations (increasing accuracy of prediction when the reference population was updated, and decreasing accuracy of prediction when the reference populations was not updated), no clear or conclusive trends were observed on the estimated heritabilities. This is potentially a topic that would generate more discussion when combined with evaluations that explore intensity of selection, and the incorporation of new individuals into a herd.
Finally, comprehending how the accuracy of genomic prediction evolves over generations within a population may add extra and relevant information to improve the performance of genomic selection. Overall these results confirm the need to continuously update the reference with new genotypes and phenotypes, but they also show that it may not be necessary to keep all ancestral genotypes indefinitely in the reference population.

Conclusions
This study presented an empirical assessment of the prediction accuracy of genomic breeding values of production traits, across five generations in two Korean pig breeds. Overall, the results suggested that, although there is a clear need to continuously update the reference population, it may not be necessary to keep all ancestral genotypes. Finally, comprehending how the accuracy of genomic prediction evolves over generations within a population may add extra and relevant information to improve the performance of genomic selection.
Author Contributions: B.C.D.C. and C.G. conceptualized the study; D.S. generated the data; C.G., D.S., and H.W. prepared the data for the analyses; B.C.D.C. performed the analyses and wrote the manuscript.
Funding: This work was supported by a grant from the Next-Generation BioGreen 21 Program (Project No. PJ01322204 and PJ01337702), Rural Development Administration, Korea.

Conflicts of Interest:
The authors declare no conflict of interest.