Bayes Factor-Based Regulatory Gene Network Analysis of Genome-Wide Association Study of Economic Traits in a Purebred Swine Population

Early stage prediction of economic trait performance is important and directly linked to profitability of farm pig production. Genome-wide association study (GWAS) has been applied to find causative genomic regions of traits. This study established a regulatory gene network using GWAS for critical economic pig characteristics, centered on easily measurable body fat thickness in live animals. We genotyped 2,681 pigs using Illumina Porcine SNP60, followed by GWAS to calculate Bayes factors for 47,697 single nucleotide polymorphisms (SNPs) of seven traits. Using this information, SNPs were annotated with specific genes near genome locations to establish the association weight matrix. The entire network consisted of 226 nodes and 6,921 significant edges. For in silico validation of their interactions, we conducted regulatory sequence analysis of predicted target genes of transcription factors (TFs). Three key regulatory TFs were identified to guarantee maximum coverage: AT-rich interaction domain 3B (ARID3B), glial cell missing homolog 1 (GCM1), and GLI family zinc finger 2 (GLI2). We identified numerous genes targeted by ARID3B, associated with cellular processes. GCM1 and GLI2 were involved in developmental processes, and their shared target genes regulated multicellular organismal process. This system biology-based function analysis might contribute to enhancing understanding of economic pig traits.


Introduction
Growth rate traits, such as average daily gain (ADG) and days to 90-kg body weight (DAYS), and production traits, such as backfat thickness (BFAT) and lean percent (PCL), have been typically considered as important traits, as they play a major role in the economic success of Korean pig breeding programs. Moreover, the lifetime total number born (LTTNB), lifetime number born alive (LTNBA), and weaning to estrus interval (WEI) are also economically important for sow longevity and reproduction. To date, these economic traits have been genetically improved successfully based on traditional best linear unbiased prediction (BLUP), and breeding values of economic traits have been used with a selection index to select elite lines in Korean pig breeding.
Recently, genomic information in the form of dense single nucleotide polymorphism (SNP) marker panels (e.g., Illumina, Neogen-GeneSeek, and Affymetrix) has become available for genetic evaluation, owing to improvements in genotyping technology and statistical methods. One of its applications is in genome-wide association study (GWAS), which has become a powerful genomics tool to identify genetic loci or genes underlying quantitative traits in domestic animals [1]. The single marker regression (SMR) method was first introduced in GWAS. However, the original and other modified SMR versions might have limited application in the estimation of SNP marker effect in the entire genome owing to various reasons: the SMR methods tend to overestimate the SNP marker effect as they ignore the effects of other SNP markers [2] and they are insufficient to detect SNPs with small effects. Therefore, it is useful to apply the Bayesian approach, which fits all possible multiple markers simultaneously, and was originally developed for genomic selection [3]. It has been shown to be a better approach for quantitative trait loci (QTL) mapping or GWAS than the SMR method in detection power [2,4,5]. Furthermore, Fortes, et al. [6] recently suggested a system biology-based strategy called association weight matrix (AWM) that integrates information from GWAS to study complex traits and identify candidate genes. Several researchers have applied this methodology using p-values of the GWAS result, but there has been no report of the use of a combination of AWM-methodology and Bayesian approach thus far [7][8][9][10].
The objectives of the present study were to: (i) conduct a GWAS using the Bayesian method to investigate the genetic architecture and chromosomal regions associated with economic traits of pigs, including growth rate and production-related traits such as litter size information in Yorkshire population using dense SNP panels and (ii) identify the co-associated regulatory network of the multi-trait Bayesian approach GWAS using the AWM methodology.

Genotypes and Phenotypes
From 2014 to 2017, 2681 Yorkshire pigs were genotyped using Illumina PorcineSNP60 version 2 (Illumina, Inc., San Diego, CA) comprising 61,565 SNP markers. After excluding SNPs that were unmapped, on sex chromosomes, and those with poor call rates (<0.95), the available number of SNP markers was 47,697. Duplicated animals (n = 60) caused by re-genotyping to obtain acceptable call rates, and animals with lower call rates (n = 30) were removed after comparing their call rates. We also removed animals (n = 19) with call rates <0.90. The parentage test was performed using SEEKPARENTF90 software (INIA, Las Brujas, UY) [11] with known parent-offspring in the pedigree file. A conflict threshold of 10% was used to detect paternity error and correct the pedigree file. Consequently, 244 genotyped animals were removed, and the pedigree file was corrected. Furthermore, genotype identification data that could not be matched to the corresponding animals in the phenotypic and pedigree files were removed, leaving 1833 animals for further GWAS. Missing SNP genotypes (0.27%) were imputed using FImpute version 2.2 [12].
All experimental procedures involving animals were conducted in accordance with the Guide for Care and Use of Animals in Research and approved by the Institutional Animal Care and Use Committee of the National Institute of Animal Science (No. 2015-137).

Measurement of Economic Traits
Body weights were measured once during performance testing (at approximately 150 days). The ADG was calculated as the difference in final weight and initial weight divided by the number of days at the time of performance testing. The DAYS was estimated according to the recommendations of the Korean Swine Performance Recording Standards (KSPRS), adjusted from birth to the time of performance testing. The BFAT was calculated based on the average fat thickness values of the shoulder (on the fourth thoracic vertebrae), mid-back (on the last thoracic vertebrae), and loin (on the last lumbar vertebrae) measured using the A-mode (amplitude mode) ultrasound device (PIGLOG 105). The PCL was calculated according to the recommendations of the KSPRS, following previously reported procedures [13]. More details of the correction formula for growth and production traits were reported by Choy et al. [14]. The three-sow reproduction-related traits LTTNB, LTNBA, and WEI were obtained from real phenotypic records.

Response Variable
Phenotypic data of 39,518 purebred Yorkshire pigs were collected from three Korea GGP farms between 2012 and 2017. Pedigree data from 99,694 individuals were also used. Table 1 shows the number of available records, phenotypic means and their standard deviation, variance component, and heritability for each trait. Genetic parameters, breeding values, and the corresponding reliability were estimated using a pedigree relationship matrix fitted with ASReml version 4.1 software (VSN International Ltd., Hemel Hampstead, UK) [15] for growth rate (ADG and DAYS), production traits (BFAT and PCL), and reproductive traits (LTTNB, LTNBA, and WEI). A multi-trait animal model was used for those parameters and estimated breeding value (EBV) including fixed effects of farm, birth-year, season, and sex. Further, the deregressed estimated breeding value (DEBV) was adjusted for parental information by a combination of deregression after adjusting for parental average such that the DEBV information contained only their phenotypic information and that of their descendants. The response variable was weighted to account for the heterogeneous variance of DEBV due to the differences in EBV reliabilities among the genotyped animals. The weighting factor [16] for animal i (w i ) was calculated as follows: where, r 2 i is the reliability of DEBV, h 2 is the heritability of the trait, and c is the proportion of genetic variation that could not be explained by markers. In the present study, c was assumed to be equal to 0.40 [17]. After removing animals with reliability <0.10, 1596 registered Yorkshire pigs were used in the GWAS.

Bayesian Method for Genome Wide Association Study
The BayesB [3] method with π set to 0.99 and weighting factors was used to estimate the effect of SNP markers and calculate variances attributed to every non overlapping 1-Mb genome window using GenSel4R software [18]. BayesB method uses a mixture model that assumes some fraction π of SNP markers have zero effect and every SNP marker has locus-specific variances. For each trait, the following model was fitted to estimate marker effects: where, y i is response variable (DEBV) of animal i for the respective trait; µ is the population mean; k is the number of markers; Z ij is allelic state at locus j in individual i; u j is the random substitution effect for marker j, which follows a mixture distribution for this random substitution effect according to indicator variable (δ j ), a random 0/1 variable indicating the absence or presence of marker j in the model, with u j assumed normally distributed N(0,σ 2 u ) when δ j = 1; and e i is a random residual effect assumed to be normally distributed N(0, σ 2 e ). The posterior distribution of the parameters and effects was obtained using Gibbs sampling for 110,000 Markov chain Monte Carlo (MCMC) iterations, of which the first 10,000 were discarded for burn-in before estimating posterior means of marker effects and variances, saving the results every five cycles. The accumulated frequency across iterations of the chain for a particular SNP based on prior π fitted in the model (referred to as "model frequency") can be used as evidence of an informative SNP or QTL [19]. However, the adjacent SNPs might be in high linkage disequilibrium (LD) with the same QTL in a high-density SNP panel and, hence, the effect of QTL and the SNP model frequency would be spread over all SNPs in high LD, which can result in the underestimation of individual SNP effect and model frequency [20]. Therefore, a window approach, which accumulates the effects of adjacent SNPs for each 1-Mb region, has been implemented in GenSel4R software, and this 1-Mb window approach was used to identify informative genomic regions accounting for LD. Initial values for genetic and residual variances for BayesB were estimated using a linear mixed model implemented in ASREML ( Table 1). All procedures were performed using GenSel4R software [18]. In total, 2452 consecutive non-overlapping 1-Mb windows across the whole genome were included in the GWAS.

Identification of Significant Window Regions and Single Nucleotide Polymorphism Markers
An additive genetic variance of 1.0%, which was estimated as a fraction of the total genetic variance explained by all SNPs, was used as the significance level of putative informative 1-Mb window region. Unlike the single marker regression approach, there is no P-value for significance of SNP marker in Bayesian approaches. Therefore, the posterior probability of inclusion of each SNP marker into the model (model frequency) in MCMC cycles is mostly used as a criterion for detecting QTLs [21]. Bayes factor (BF) derived from model frequency was used to determine the SNP with a significant association within this region.
where, π is the prior probability andp i is the posterior probability of the fraction of times the SNP was distributed. Following the definitions of Kass et al. [22] for the strength of an association based on their range of values, the SNP markers with BF > 3.2, > 20, and > 100 were considered "suggestive," "strong,", and "decisive" evidence, respectively.

Association Weight Matrix Construction
The AWM consists of rows representing genes and columns representing the additive effect of each trait based on the results of the GWAS [6]. Before construction, we selected a "weakly" significant criterion of Bayes factor of ≥ 3.2 [22], and BFAT was used as a key phenotype among the seven traits. Firstly, SNPs that were significantly associated with BFAT or associated with at least two phenotypes were selected. Secondly, the SNPs satisfying the distance information of SNPs to the nearest annotated coding region of the gene were additionally filtered, i.e., those that were either <2500 bp or >1.5 Mb away from the nearest gene were eliminated. Finally, only one SNP was selected to represent the gene (the first criterion was the number of statistically significant traits to the SNP and the second was more significant to the key phenotype). The partial correlation and information theory (PCIT) algorithm was used to identify a significant interaction among the genes and SLP-related traits using the PCIT library in R [23]. The hierarchical clustering option in PermutMatrix software [24] was used to visualize the AWM. To visualize the network of the AWM genes, every significant co-associated gene was applied in Cytoscape, and the network density of each gene was obtained using the MCODE sub-package [25].

Network Analysis Using Transcription Factor and Target Gene Information
To provide in silico validation of the gene-gene interactions and validate the whole network, among various available methods, we used bio-informatics analysis that predicts TFs and their target genes [6]. To determine whether a gene is a TF or not, it was compared with the pig and human transcription factor database list sets [26]. The genes identified as TFs required motif information. Sus scrofa motif information from the CisBP database [27] was mainly used, and vertebrate data from JASPAR [28] were used to supplement missing information. We extracted the flank region sequence (upper 2000 bp) of every gene in the whole network from the Ensembl BioMart database [29]. To identify locally overrepresented TF binding sites (TFBS), the FIMO tool [30] was used. It detected all the TFBS and extracted the significant clusters (P < 0.001) by calculating their score functions [31]. The top three TFs were chosen to satisfy the maximum coverage as previously reported [7]. The classification analysis of the function of node gene was analyzed by inputting the list of gene ensemble ID into the Panther classification system [32].

Genome-Wide Association Study Using Single Nucleotide Polymorphisms Markers with Illumina PorcineSNP60
We performed a GWAS using SNP markers on Illumina PorcineSNP60 based on several parameters estimated by the BayesB method (i.e., the absolute SNP marker effect, model frequency, and the genetic variances explained by SNP markers). Bayesian GWAS applies the threshold for the significance of SNP markers based on the derivative of model frequency (i.e., BF) [21]. However, a single QTL could spread the effects over multiple SNPs when using high-density SNP panel as a high linkage disequilibrium (LD) within adjacent SNP markers. These results may lead to an increase in the probability of false positives and false negatives [19]. To overcome these problems, we used two thresholds: (i) additive genetic variance by accumulating within 1-Mb chromosomal regions and (ii) BF based on the model frequency. The results of the GWAS of growth, productive, and reproductive traits including chromosomal and window location (Mb), the percentage variance of 1-Mb genome windows, SNP, physical genome position (Mb), additive effect of the significant SNP marker within these regions, and BFs in Yorkshire pigs are presented in Tables 2 and 3. In this study, the threshold of percentage variance of 1-Mb genomic region and BF used to identify associations with traits were > 1.0% and 20, respectively. The Manhattan plots for the analyzed traits are shown in Figure 1. Table 2. Informative 1-Mb genome windows and single nucleotide polymorphisms (SNPs) within windows associated with growth rate (ADG and DAYS) and production (BFAT and PCL) traits in Yorkshire pigs from the GWAS using markers on Illumina PorcineSNP60.

Growth-Related Traits
In the present study, the most informative 1-Mb window region was detected on SSC17 at 17 Mb, which explained 1.88% and 2.22% of additive genetic variances for ADG and DAYS traits, respectively. Furthermore, the most significant QTL was found at 17.55 Mb on SSC17 (rs342665431) with the highest BF: 298.75 and 764.87 for ADG and DAYS traits, respectively. A previous study [20] reported that the most significant SNP (rs342665431) was from the BMP2 gene on SSC17, which is consistent with our results. The BMP2 gene is a member of the bone morphogenetic protein family that regulates early myogenesis. We also found identical informative 1-Mb window regions between the growth-related traits and SNP located at the 93-Mb position of SSC5 (rs345168974) with 1.40% and 1.12% additive genetic variances for ADG and DAYS traits, respectively.

Production-Related Traits
We found 15 significant QTLs within 12 informative chromosomal regions (significance level > 1.0% additive genetic variance or BF > 20) on SSCs 2, 4, 5, 6, 7, 8, 14, 15, and 16 for BFAT and PCL. The most significant 1-Mb window region explaining 3.51% and 5.87% of additive genetic variances was captured on SSC2 at 162 Mb, including two SNPs (rs81341288 and rs81328276) in BFAT and PCL. The QTL window located on SSC at the beginning, which explained 2.68% and 4.48% of additive genetic variances, included SNPs (rs81317307 and rs81318741) for BFAT and PCL. Furthermore, rs81317307 was the most significant SNP based on BF (1268.40) associated with PCL. Van Laere, et al. [33] reported that the IGF2 gene on SSC2 has an important role in the development of skeletal muscle and BFAT as well as postnatal muscle regeneration and hypertrophy. Other QTL windows were also detected on SSC2 at 76 Mb, which explained 2.87% and 1.49% of additive genetic variances for two production-related traits. The most significant SNP located on SSC5 at 65 Mb (rs81343150) was identified based on BF (304.09) for BFAT.

Reproduction-Related Traits
Some candidate chromosomal regions and QTLs associated with reproduction-related traits were identified. We found six significant QTLs within five informative chromosomal regions on SSCs 1, 12, and 16 for reproduction-related traits. Among those genes, the superoxide dismutase 2 (SOD2) gene has been reported to have one polymorphism associated with male infertility [34]. Considering the results of the GWAS, a few QTLs for LTTNB, LTNBA, and WEI but no QTL were identified for NPW, which might be due to low heritability (Table 1) and relatively smaller sample size for detecting significant QTL regions. Another reason for this result might be high criteria of significance. Because of the high cut-off criteria for the significance level of the traditional single trait GWAS strategy, it is difficult to determine the useful QTL on those reproduction-related traits [35].

Co-Association Network Based on Association Weight Matrix
The constructed AMW consisted of 215 (211 genes and 4 SNPs) × 7 (traits), and each cell represents their z-value normalized additive effect (Figure 2a). To visualize this, Permutmatrix software [24] was used. There were three main obvious findings in the visualized matrix. First, the genes had strong effects on both of body-related traits (PCL and BFAT) and growth-related traits (DAYS and ADG). Second, PCL and BFAT, and DAYS and ADG pairs, respectively, were almost the compensate tendency by the effects of the genes. This was an obviously understandable deduction from the meaning of the traits (e.g., meat percentage information versus fat information for PCL and BFAT). Another finding is that it was difficult to identify the specific tendency of the effect on reproductive traits (LTTNB, LTNBA, and WEI) because it was quite different from that on the traits of the two groups mentioned above and it had a weak effect (relatively dark compared to other groups). Therefore, research on those reproduction-related traits is limited by the traditional GWAS method [35] and AWM-based approach has recently emerged as a useful option [6,10].  The whole network created by the PCIT analysis based on AWM information consists of 226 nodes and 6921 edges (Figure 2b). The rectangle node indicates seed on the network and the diamond node is uncluttered based on the MCODE analysis of Cytopscape application . Based on the comparison  with the TF database, ARID3B, ATF6B, DMTF1, GCM1, GLI2, ISL1, KDM5B, KLF17, NFYC, NPAS3, and WDHD1 were identified as TFs. Among them, five TFs (ARID3B, GCM1, GLI2, ISL1, and NFYC) had motif information. Each TF and its target gene network were matched using FIMO tool analysis and, finally, the top trio network consisting of 43 genes and centered on ARID3B, GCM1, and GLI2 was constructed (Figure 2c).
The AT-rich interaction domain 3B (ARID3B) encodes a member of the AT-rich interaction domain (ARID) family of DNA-binding proteins [36]. Studies have reported that the ARID3B gene affects the regulation of limb development [37]. However, the function of the ARID3B gene in pigs has not been studied comprehensively. The ARID3B has 13 in silico validated target genes (CTTNBP2, DRC1, ENSSSCG00000017864, FAM134C, ICA1L, KIAA1324L, LOC100038019, LOC100155829, LOC100518725, MMP1, PARS2, PDE4B, and POLR2G), most of which (six out of 13 genes, DRC1, ENSSSCG00000017864, LOC100518725, PARS2, PDE4B, and POLR2G) are involved in the cellular process (Figure 2c). Many of them were essential genes for survival. For example, DRC1 is essential for motile cilia function in algae and humans, and POLR2G encodes the RNA polymerase II subunit G [38,39]. Moreover, PARS2 encodes a putative member of the class II family of aminoacyl-tRNA synthetases and those with mutations in PARS2 could have Alpers syndrome [40].

Conclusions
This study not only provides a list of chromosomal regions and SNPs associated with economically important traits, but also their candidate associated genes. The information about the SNP markers and chromosomal regions associated with the studied traits could be considered as prior information in a genomic selection model. Additionally, to the best of our knowledge, this is the first study to propose a BF-based regulatory gene network, unlike AWM with p-value information reported previously. This co-association regulatory network created using BFAT as a key trait, would facilitate the validation of the genetic understanding of other economically important traits in pigs. These biologically non-similar traits network could be very useful for the development of improved breeding strategies in the future. Further studies are needed to clarify the specific molecular or cellular processes of interaction among the TF trios and their target gene networks predicted to determine economically important traits in pigs.

Conflicts of Interest:
The authors declare no conflict of interest.