Genomic Analysis Using Bayesian Methods under Different Genotyping Platforms in Korean Duroc Pigs

Simple Summary This study investigated the informative regions and the efficiency of genomic predictions for backfat thickness, days to 90 kg body weight, loin muscle area, and lean percentage in Korean Duroc pigs. The several regions of the genome were identified and a significant marker was found near the MC4R gene for growth and production-related traits. No differences in genomic accuracy were identified on the basis of the Bayesian approaches in these four growth and production-related traits. The genomic accuracy is improved by using deregressed estimated breeding values including parental information as a response variable in Korean Duroc pigs. Abstract Genomic evaluation has been widely applied to several species using commercial single nucleotide polymorphism (SNP) genotyping platforms. This study investigated the informative genomic regions and the efficiency of genomic prediction by using two Bayesian approaches (BayesB and BayesC) under two moderate-density SNP genotyping panels in Korean Duroc pigs. Growth and production records of 1026 individuals were genotyped using two medium-density, SNP genotyping platforms: Illumina60K and GeneSeek80K. These platforms consisted of 61,565 and 68,528 SNP markers, respectively. The deregressed estimated breeding values (DEBVs) derived from estimated breeding values (EBVs) and their reliabilities were taken as response variables. Two Bayesian approaches were implemented to perform the genome-wide association study (GWAS) and genomic prediction. Multiple significant regions for days to 90 kg (DAYS), lean muscle area (LMA), and lean percent (PCL) were detected. The most significant SNP marker, located near the MC4R gene, was detected using GeneSeek80K. Accuracy of genomic predictions was higher using the GeneSeek80K SNP panel for DAYS (Δ2%) and LMA (Δ2–3%) with two response variables, with no gains in accuracy by the Bayesian approaches in four growth and production-related traits. Genomic prediction is best derived from DEBVs including parental information as a response variable between two DEBVs regardless of the genotyping platform and the Bayesian method for genomic prediction accuracy in Korean Duroc pig breeding.


Introduction
Genomic selection (GS) has been widely applied to several species, for example, pigs, chickens, beef, and dairy cattle, using commercial single nucleotide polymorphism (SNP) genotyping platforms from Illumina, GeneSeek-Neogen, and Affymetrix. These arrays estimate genomic-enhanced estimated breeding values (GE-EBVs), which are blended with classical estimated breeding values (EBVs) from classical genomic best linear unbiased prediction (BLUP) and molecular breeding values (MBVs) from summation of single nucleotide polymorphism (SNP) marker effects for genotyped animals. The most important parameter in genomic prediction modeling is the accuracy of genomic prediction for the estimation of GE-EBVs because the weights are determined on the basis of that parameter when blended with traditional EBVs and MBVs in a "correlated traits" approach [1]. Two terms in the classical concept of quantitative genetics in the formula of genetic progress (∆G = i × r × σ g L ) that are directly affected by the implementation of genomic selection in the pig industry are the generation interval (L) and accuracy (r).
Genetic improvements are achieved by reducing the generation interval and increasing the accuracy through genomic selection modeling in dairy and beef cattle. However, in pigs, genetic improvements with the generational interval parameter are limited by rapid generational turnover. Therefore, increased accuracy of genetic predictions may be the largest parameter impacting genomic selection in pig breeding [2]. These authors [2] also reviewed the accuracy of genomic prediction for maternal, performance, and carcass traits in pigs. Breeders of various animal species have conducted research using various prediction models with the aim of increasing the accuracy of genomic prediction for more reliable GE-EBVs. However, these models are affected by several factors, such as the size of the reference data, which mean having both genomic and phenotypic data [3,4], density of genotyping platforms [5][6][7][8], relationships between training and testing sets in the process of cross validation for genomic accuracy [9], and choice of response variables in genomic prediction models [8]. In addition, the models were also affected by the choice of covariates (individual SNP vs. haplotype) in genomic prediction modeling [10], choice of penalty or a priori density in statistical methods (e.g., regression on SNP marker) [11], and causative variants or SNPs in strong linkage disequilibrium (LD) with causative variants [12].
The objectives of this study were to (1) identify informative genomic regions through a genome-wide association study (GWAS) and (2) to investigate and compare the accuracy of genomic prediction of two genomic evaluation methods using two Bayesian methods (BayesB and BayesC) under two medium-density, SNP genotyping platforms with two response variables (DEBVexcPA and DEBVincPA). This is the first study to assess the accuracy of genomic prediction for growth and production-related traits in Korean Duroc pig populations.

Genotype and Phenotype Data Editing and Imputation
A total of 1026 Duroc pigs were genotyped. These animals were genotyped with two medium-density SNP genotyping platforms, including 487 genotyped by the Illumina PorcineSNP60 version 2 (Illumina, Inc., San Diego, CA, USA) and 539 genotyped by GeneSeek-Neogen PorcineSNP80 (BeadChip Neogen Agrigenomics, Lincoln, NE, USA), respectively. They consisted of 61,565 and 68,528 SNP markers, respectively. The quality control measures in SNP markers and animals were performed by excluding 7849 and 7758 unmapped SNPs, 1458 and 3273 SNPs on sex chromosomes, 6399 and 1613 SNPs with a poor call rate (<0.90), 17 and 18 SNPs with a poor call rate for a duplicate SNP map-position for SNP markers, 7 and 5 animals with poor call rates (<0.90), 8 and 34 animals that did not match with phenotypes for each Illumina PorcineSNP60 version 2 and GeneSeek-Neogen PorcineSNP80 genotyping platform, and 3 animals with a poor call rate for duplicate genotypes between two SNP genotyping platforms. Consequently, the number of available SNP markers was 45,840 and 55,866 for 60K and 80K SNP panels, respectively, leaving 472 and 500 animals for 60K and 80K SNP panels, respectively, for use in further genome-wide association studies, and genomic prediction modeling.
The imputation processes of these two medium density SNP panels (Illumina60K and GeneSeek80K) were separately performed using the following two steps: (1) FImpute version 2.2 [13] imputed missing SNP genotypes of two SNP panels (50K and 80K) on the basis of the information for each marker map to be used for the reference panel and (2) FImpute version 2.2 [13] imputed between two SNP panels, from Illumina60K to GeneSeek80K and from GeneSeek80K to Illumina60K. Finally, we accepted two kinds of reference population for further genomic analysis, consisting of 972 animals in the imputed 60K and 80K data because there were no duplicate genotypes between the Illumina60K and GeneSeek80K.

Deregression of Expected Breeding Values (DEBVs) for Response Variables
A multitrait animal model with 46,305 phenotypic data recorded from 2005 to 2017 and 72,781 pedigree records was applied to estimate the variance components and genetic parameters (Table 1). This was required as a priori information for the genomic prediction model and for EBVs and corresponding reliabilities for the genotyped animals and their sires and dams. These analyses used the ASReml version 4.1 software [14] for four growth and production-related traits: backfat thickness (BFAT), days to 90 kg body weight (DAYS), loin muscle area (LMA), and lean percent (PCL). Phenotypes were adjusted for fixed effects using contemporary groups comprising of farm, birth-year, season, and sex effects. A common litter environment effect was also included in a multitrait animal model for those parameters and EBV. We used the methodology provided by Garrick et al. [15] for the two kinds of DEBVs, which were (1) a combination of deregression (dividing by the reliability of the EBV) and adjustment for ancestral information (i.e., parental average, which only contained their own and the descendant's information, hereafter called "DEBVexcPA"), and (2) in contrast to Garrick et al. [15], the parent average EBV (PA) was added back to the DEBV (hereafter called "DEBVincPA") to account for breed and family differences in subsequent analyses. These two DEBVs (DEBVexcPA and DEBVincPA) were obtained using Equation (1): with the corresponding weighting factors using Equation (2): whereĝ i is the EBV (estimated breeding value) of the individual, PA is its parent average, h 2 is the heritability, r 2 i is the reliability of the EBV of the individual, and c is the proportion of genetic variation that could not be explained by the markers. In this study, c was assumed to be equal to 0.40 and is the proportion of the genetic variance not explained by SNP markers, as suggested by Saatchi et al. [16].
After removing animals with a reliability of less than 0.10, 964 registered Duroc pigs remained for further analysis.

Statistical Method for Estimating SNP Effects
Two methods (BayesB [3] and BayesC [17], with π set to 0.99 and weighting factors), were used to estimate SNP marker effects using the GenSel4R software [18] for GWAS and genomic prediction models. The BayesB and BayesC methods use the mixture model that assumes some fraction π of SNP markers have zero effects and assumes that SNP markers have non-zero effects. The BayesB method uses the t-distribution a priori for the SNP marker effects and has locus-specific variances whereas the BayesC method uses the normal distribution a priori for the SNP marker effect and has a common variance [19]. For each trait, the model was fitted to estimate SNP marker effects for these two methods using Equation (3): where y i is response variable (DEBVexcPA or DEBVincPA) on animal i for the respective trait, I? is the population mean, k is the number of markers, Z ij is allelic state at locus j in individual i, and u j is the random substitution effect for marker j, which follows a mixture distribution for this random substitution effect according to indicator variable (δ j ). A random absent (0) or present (1) variable indicates the absence or presence of marker j in the model, with u j assumed normally distributed N(0, σ 2 u ) when δ j = 1, and e i is a random residual effect assumed normally distributed N(0, σ 2 e ). The posterior distributions of the parameters and effects were obtained using Gibbs sampling, for a total number of 110,000 Markov chain Monte Carlo (MCMC) iterations, the first 10,000 of which were discarded for burn-in, before estimating posterior means of marker effects and variances, and a sampling interval (thinning) of 10. All procedures were implemented in GenSel4R software [18]. The convergence of MCMC iterations was tested by comparing results from three iteration lengths (75,000 vs. 110,000 vs. 150,000) with the first 10,000 cycles being discarded and having a sampling interval of 10. The differences in the posterior means of genetic and residual variances were negligible among three MCMC iteration lengths for all growth and productive-related traits (results not shown).

Identification of Significant Window Regions and SNP Markers
The 0.8% of additive genetic variance, which was estimated as a fraction of the total genetic variance explained by all SNPs, was used for the significance level of the putative informative 1 Mb window region. A total of 2454 1-Mb window regions located on autosomes were considered for two SNP genotyping platforms (Illumina60K and GeneSeek80K) in this analysis. The theoretical rate of the genetic variance could be assumed approximately 0.04% (100% /2454), but the stringent threshold of 0.8%, which is twenty times higher than the theoretical proportion was considered as the small reference set in Korean Duroc pigs. The Bayes factor (BF) was used to determine SNPs with a significant association within this region using Equation (4): where π is the prior probability andp i is the posterior probability that an SNP was included in the model. Following the definitions of Kass and Raftery [20] for the strength of an association on the basis of the range of values, the SNP markers with a Bayes factor above 3.2 were considered as "suggestive evidence", above 20 was described as "strong evidence" and above 100 was described as "decisive evidence".

Accuracy of Genomic Prediction under a 10-Fold Cross-Validation
To account for the relatively small sample size of the prediction model, a 10-fold cross-validation strategy was used to estimate the accuracies of the genomic prediction models. Previous study related to the number of folds on the process of the cross validation have reported that trade-off effects were detected between the number of folds and the relationships between training and testing sets [8]. Nevertheless, we used a 10-fold cross-validation to maximize the size of the training data because of the limited reference data set in Korean Duroc pigs. For each trait of interest in this study (BFAT, DAYS, LMA, and PCL) and following the procedures outlined by Saatchi et al. [9], genotyped animals were split into ten groups using K-means clustering to reduce the relationships between training and testing populations. A total of 3821 elements of pedigree information related to the 964 genotyped Duroc pigs was used for K-means clustering, giving the number of individuals within each fold, and within and between fold averages of a max and a ij , and their standard deviations ( Table 2). Accuracies of genomic prediction were assessed by the correlation between the MBVs of genotyped animals from each validation set and their response variables, r(ŷ, y), where y is a vector of pseudo-phenotypes (DEBVexcPA or DEBVincPA) for the validation set andŷ is a vector of MBV for the corresponding animals in y.

Assessing the Accuracy of Imputation
The imputation process was performed to test the imputation accuracy of two SNP genotyping platforms, the Illumina60K and GeneSeek80K. The accuracy of imputation with a higher minor allele frequency (MAF) was lower than for those with a lower MAF for both SNP genotyping platforms ( Figure 1). These results are consistent with the results of Badke et al. [21], who showed that the proportion of correctly imputed alleles decreased by increasing the number of SNPs with a high MAF in Yorkshire pigs. Using dairy cattle, Ma et al. [22] showed that the imputation accuracies were lower with a higher MAF across available imputation programs [13,[23][24][25][26]. The accuracies of imputation using simulation studies were 98.6% from the Illumina60K to GeneSeek80K SNP panel and 99.4% from the GeneSeek80K to Illumina60K SNP panel. The accuracies of imputation were similar and consistent across chromosomes for imputation to both SNP platforms from the other SNP platform, likely because the proportion of common SNP markers between the two SNP genotyping platforms (Illumina60K: 59.1% and GeneSeek80K: 53.1%) was high.

Genome-Wide Association Study (GWAS) for Growth-and Production-Related Traits
GWAS for growth and production traits was performed using two commercially developed Porcine SNP genotyping platforms (Illumina60K and GeneSeek80K) to identify the most informative window regions and significant SNP markers based on the Bayes factor within these regions. GWAS analyses using BayesB with a high value of π (0.99) and a DEBVincPA response variable for growth and production-related traits in Duroc pigs were chosen because the informative window region and significant SNP markers were similarly distributed across the three response variables and Bayesian methods. The results of these associations are shown in Tables 3 and 4, and Figure 2. Table 3. Informative 1 Mb genome windows and significant single nucleotide polymorphisms (SNPs) based on the Bayes factor within windows associated with growth-and production-related traits in Korean Duroc pigs from the genome-wide association study (GWAS) using markers on the Illumina PorcineSNP60 genotyping platform.   Three and four informative windows (1 Mb) were detected for BFAT using the Illumina60K panel and GeneSeek80K panel, respectively. The most significant window was identified on Sus scrofa chromosome (SSC)1 at 62 Mb using the Illumina60K panel and on SSC1 at 178 Mb using the GeneSeek80K panel, which explained 1.26% and 1.88% of genetic variance, respectively. Significant SNP markers, based on the Bayes factor, common to both two panels were ALGA0003581 and ALGA0003587, which were located on SSC1 at the 62 Mb position nearby the CGA gene. For DAYS, we detected five informative quantitative trait loci (QTL) using the Illumina60K panel and three informative QTLs using the GeneSeek80K panel. The regions of SSC7 at 124 Mb (1.58%) and SSC18 at 29 Mb (1.19%) were the most informative 1 Mb window regions in GWAS for DAYS using the Illumina60K and using GeneSeek80K, respectively. The common significant SNPs were ALGA0097693 (located on SSC18 at the 29 Mb position between TSPAN12 and CFTR) and ASGA0004988 (located on SSC1 at the 177 Mb position between RNF152 and MC4R) in both panels. We identified six significant regions using the Illumina60K panel and five significant regions using the GeneSeek80K panel for LMA. The most significant region using the Illumina60K was detected on SSC5 at 87 Mb (2.36%), and SSC1 at the 178Mb (3.56%) region was detected when using GeneSeek80K. Two informative QTLs were detected using the Illumina60K and GeneSeek80K panels for PCL. In addition, common significant SNPs were not identified using either panel. The GeneSeek80K genotyping panel contained more SNP markers in major genes than the Illumina60K genotyping panel (i.e., MC4R). As a result, the informative SNP markers not included in the Illumina60K panel were detected using the GeneSeek80K panel. Interestingly, the WU_10.2_1_178188861 SNP located by the GeneSeek80K panel was associated with all growth and production traits except DAYS, which is not included in the Illumina60K panel and is located on SSC1 at the 178 Mb position between RNF152 and MC4R. For DAYS, ASGA0004988, which was positioned on SSC1 at 177 Mb, was detected as an informative SNP marker, but the nearest gene was MC4R. The MC4R gene is a major determinant of the nervous system and plays a substantial role in the regulation of food intake, energy balance, and body weight in mammals [27][28][29]. Previous studies [29][30][31][32] have reported the identified QTL near the MC4R gene located at 178 Mb on SSC1 as Sscrofa10.2. Our findings related to the 178 Mb region on SSC1, along with other significant regions, were consistent with previously identified regions that potentially impact growth and production traits in the Animal QTL database.  Table 2 shows that the data were successfully partitioned using the K-means clustering method for genomic evaluation, whereby the relatedness was maximized within each partitioned group and minimized between each partitioned group. The accuracy of the genomic prediction with BayesB using the Illumina60K ranged from 0.179 (BFAT) to 0.234 (LMA) and from 0.247 (BFAT) to 0.314 (LMA) for DEBVexcPA and DEBVincPA, respectively (Table 5). A similar trend was observed with BayesC when using the GeneSeek80K, with a range of accuracies from 0.176 (BFAT) to 0.246 (LMA) and from 0.250 (BFAT) to 0.331 (LMA) for DEBVexcPA and DEBVincPA, respectively (Table 5). These results indicate similar levels of accuracy of genomic prediction regardless of the genotype platform or Bayesian method. However, a slight increase in the accuracy of genomic prediction was observed in DAYS (2%) with the DEBVexcPA response variable and LMA (2% and 3%) with the DEBVexcPA and DEBVincPA response variables when comparing the GeneSeek80K to the Illumina60K SNP genotyping platform. These comparisons between different SNP genotyping platforms have also been studied in beef and dairy cattle [7,33]. In cattle, the accuracies of genomic prediction were compared between moderate and high-density panels (50K and 777K) or between moderate-density genotype panels (50K and 80K). Pérez-Enciso et al. [7] observed that the reliabilities of genomic predictions did not increase when using a high-density SNP chip (HD) compared with a 50K SNP chip. Lee et al. [8] and Guo et al. [33] also reported no significant improvement in accuracy when using a 50K panel vs an 80K panel for Red Angus beef cattle in the United States. Overall, no significant improvements in prediction accuracies on the basis of SNP panel density have been observed from the results of previous genomic prediction studies (50K vs. 777K or 50K vs. 80K) because even though the number of SNPs increases, the panel may contain a small number of SNP markers in high LD with causative variants. In addition, simply increasing SNP markers instead of causal mutation may bring an additional source of noise to genomic prediction [5,7,8]. The results revealed a slight increase in genomic accuracy for DAYS and LMA with the BayesB method and the GeneSeek80K SNP platform compared with the Illumina60K SNP platform. This was because the GeneSeek80K SNP platform includes a causal variant in strong LD with the MC4R gene [2], which was the most informative for LMA (Table 4, Figure 2). These results are consistent with the results of Pérez-Enciso et al. [7] and Van et al. [12], suggesting that the inclusion of causative variants or SNPs with high LD with causative mutation improved the accuracy of genomic prediction.

Response Variables (DEBVincPA and DEBVexcPA)
The average accuracies of genomic prediction across the growth-and production-related traits ranged from 0.177 (BFAT) to 0.244 (LMA) for the Illumina60K and GeneSeek80K when using DEBVexcPA as a response variable, and from 0.252 (BFAT) to 0.327 (LMA) for the Illumina60K and GeneSeek80K when using DEBVincPA as a response variable with 10-fold cross validation (Table 5). In the current study, we observed higher prediction accuracies when using DEBVincPA as a response variable compared with DEBVexcPA for all studied traits. Interestingly, the largest difference (+8.3%) in terms of average accuracies of genomic prediction between the two response variables was observed for the lowest heritable trait (LMA) when using DEBVincPA as a response variable. While DEBVexcPA [15] has the greatest numerical properties in addressing double counting by removing the parental contribution [8,33,34], our results showed a lower performance in prediction accuracies in comparisons of two response variables. Boddhireddy et al. [34] reported that using EBV without removing parental contributions as a response variable yielded greater prediction accuracies compared to using DEBVexcPA in both validation tests for US Angus beef cattle. Lee et al. [8], however, observed that the genomic accuracies obtained using DEBV after removing parental information as a response variable were higher than those obtained using DEBV without removing parental information in growth and carcass traits in US Red Angus beef cattle. The differences in genomic accuracies among the different panels were not significant for the traits used in this study. Although the GeneSeek80K panel contained a major marker, MC4R, this had no influence on genomic prediction; however, the genomic accuracy using this panel was approximately 3% higher than when using the Illumina60K panel in LMA.
An advantage of excluding the parent average (PA) was to avoid double counting. Otherwise using PA would shrink the individual EBV toward to the parent average [15]. However, the inclusion of PA after deregression added an advantage by accounting for the differences in PA among genotyped animals, such as between family differences [35]. For all studied traits, DEBVincPA, as the response variable, showed higher genomic accuracy than DEBVexcPA. This finding supports the result of Lee et al. [36] that DEBVincPA compared with other response variables (EBV and DEBVexcPA) was the most advantageous genomic prediction in Korean Yorkshire pigs. However, because this result is from a relatively small training size, we need further studies to verify the biased accuracy from the double counting issue by securing a larger training size.

Conclusions
In this study, we identified candidate genes for growth-and production-related traits in purebred Korean Duroc pigs, and evaluated and compared the accuracy of genomic prediction between two genotyping platforms, response variables, and two Bayesian methods (BayesB and BayesC). A total of 15 and 12 informative 1 Mb window regions for growth-and production-related traits were identified using the Illumina60K and GeneSeek80K panels, respectively. The genomic accuracy when using DEBVincPA as the response variable was of higher value than other response variables. We suggest that a fine-mapping study is necessary to pinpoint the causal variant of the informative genomic region (i.e., the MC4R gene), and that the genomic accuracy for growth-and production-related traits will be improved by adding a pinpoint for the causal variant of the informative genomic region. Furthermore, a genomic selection model for growth-and production-related traits could be useful for future genomic evaluation in purebred Korean Duroc pigs.