1. Introduction
Large scale genotyping using high-density arrays or next-generation sequencing techniques has revolutionized genetic prediction and identification of causal chromosomal regions through genome-wide association studies (GWAS). Furthermore, apportioning genetic variation to chromosomal regions (a goal of GWAS and an estimation of regional heritability using best linear unbiased prediction using a regional genomic relationship matrix and restricted maximum likelihood estimation of variance components) can improve genome prediction accuracy [
1]. Genetic prediction accuracy is improved by increased variance in regional genomic relationships and higher, more consistent linkage disequilibrium between observed SNP and unknown or unobserved quantitative trait loci compared to the whole genome. Hence, GWAS and genome prediction are complimentary activities. However, effective implementation of genomic prediction or high-powered GWAS for complex and/or low heritability traits may require thousands of animals with phenotypes and genotypes. Individually genotyping seedstock populations is cost effective when the cost is spread among large suites of routinely recorded traits. However, individually genotyping can be prohibitively expensive when collecting novel traits on commercial animals, without recorded pedigree, for which only one trait is recorded.
DNA pooling can be an effective tool to reduce genotyping cost, and it captures greater than 80% of the power of individual genotyping in GWAS [
2]. If costs of collecting phenotypes are much lower than genotyping costs, DNA pooling can reduce experimental costs by 90%. Our group has established the utility of GWAS using DNA pooling for novel complex traits such as fertility and disease resistance [
3,
4,
5].
In addition to GWAS, early efforts have been proposed to apply pooling results to genomic prediction. Marker predictions from high and low phenotype groups could be used to accurately rank candidates for selection [
6]. An alternative strategy could be the estimation of genomic relationships between pools and candidates for selection; for instance, genomic relationships could be used to derive estimated breeding values for sires of multiple-sire progeny in pools of extreme phenotypes [
7]. Both techniques involve genomic prediction of animals that are known to have close relationships with animals in the pool. Commercial data capture is complicated by the fact that whole genome relationships between commercial cattle (source of data) and seedstock animals (selection candidates) are more distant and less variable compared to relationships between animals with data and animals being selected within the seedstock sector. The effectiveness of using SNP chip data from pools for genomic prediction depends on the signal to noise ratio, with the signal being allele frequency differences or haplotype frequency differences between pools of animals with extreme phenotypes, and noise being pooling errors. Greater genetic differences occur when the trait has higher heritability or when phenotypes are more extreme. Pooling errors include pool construction error and technical error. Pool construction error includes weighing error, errors in measuring or recording DNA concentration, incomplete sample mixing, pipetting error, variation in DNA content of the tissue, variation in DNA extraction efficiency, and DNA fragmentation. In this study, technical error is defined as variation in estimates of sub-pool, animal, or haplotype contributions among replicate arrays for identical pools (same animals in same proportions). However, replicate arrays were not run in this study because technical error can be estimated at a lower cost as the variance in estimated animal contribution among sub-samples of SNP, sampled without replacement. This is the approach that we took. Technical error depends on variation in in allelic ratio (Y/X) for intensity within genotype for each SNP and the number of SNP; specifically, we estimate pooling allele frequency using the same formula that Illumina uses, Illumina θ = 2 * tan
−1(Y/X)/π [
8], for calling genotypes and detecting copy number variation (or structural variants). The number of animals in the pool is inversely proportional to the average contribution from each animal; hence, we deliberately varied the contribution from each animal to approximate the influence of each pool’s size on error without spending a lot of money genotyping individual animals. In this way, we can approximate error for pool sizes of 100 with only 16 animals. Our objectives were to evaluate pool construction, technical error, and the influence of factors that affect these errors, such as number of SNP and planned representation of animals within the pool.
2. Materials and Methods
Animal samples utilized in this study were recovered from abattoirs. Therefore, no Institutional and Animal Care and Use approval was obtained.
Sample pooling strategy: Pools comprising many animals are more economical because the cost of genotyping a pool is spread over more animals. Pools with many animals have small contributions from each animal. To approximate variable numbers of animals per pool at a reasonable cost, we selected an experimental design with variable contributions from individual animals. Liver tissue samples were obtained from a set of 50 Holstein steers with severely abscessed livers and 50 Holstein steers with no liver abscesses in close chain proximity from a commercial abattoir. From these pairs of matched samples, 16 samples from abscessed and 16 samples from non-abscessed animals were randomly selected. There were no common animals between the two phenotypes; however, there were likely steers with abscessed livers that were related to animals with non-abscessed livers. Disease status (abscessed and non-abscessed livers) was used as a basis for dividing samples in this experiment into biological replicates.
Four sub-pools were created for each liver phenotype (abscessed/non-abscessed). A sub-pool was comprised of four different liver tissue samples in parts of 1:2:3:4 (each part was 0.1 g) and placed in a 10 mL tube for the designated sub-pool with 2 mL phosphate buffered saline for homogenization. Each liver tissue sample only contributed to one sub-pool (4 sub-pools × 4 samples/sub-pool = 16 samples total per phenotype). Isolation of DNA from each sub-pool was performed by a standard phenol/chloroform extraction protocol. Two super pools were then derived from the extracted DNA of the sub-pools in two different arrangements, with the four sub-pools again representing parts of 1:2:3:4. As shown in
Figure 1, four different sub-pools and two different super pools were formed, as previously described, for each liver phenotype. Thus, the planned representation of 16 individual animals are in
Table A1 and
Table A2 based on the design presented in
Figure 1. Individual DNA from the 32 animals used in the liver tissue pools was extracted using the QiAamp DNA Mini Kit following the manufacturer’s instructions (Qiagen, Santa Clarita, CA, USA). All individual DNA samples and pools were then mixed, by placing individual samples on a rotator, and quantified by the DeNovix DS-11 FX+ Spectrophotometer/Fluorometer (DeNovix Inc., Wilmington, DE, USA) using 2 µL of sample and the dsDNA photometric setting. Quality of the DNA was evaluated for all individual samples and pools using gel electrophoresis to ensure that a high molecular weight DNA was present and intact.
Our experimental design resulted in pools with a broad range of sample contributions ranging from 1 to 40% (
Figure 1;
Table A1 and
Table A2). Furthermore, there was planned proportionality in sample representation that was constant between sub-pools and super pools, allowing us to evaluate the stability of proportionality estimates at different dilutions. Discrepancy from proportionality would indicate instability of pool composition, changes in real animal proportions over time, or technical error. Realized or observed animal contributions were estimated from Illumina θ [
8]. Illumina θ for super pools and sub-pools was computed by 2 * tan
−1(Y/X)/π where X is the red intensity identifying the ‘A’ allele and Y is the green intensity representing the ‘B’ allele, using the Illumina AB nomenclature [
8]. We term pooling error sources as ‘pool construction’ (estimated DNA quantity does not match planned quantity), ‘technical error’ (caused by variation in Illumina θ within genotype or pools with the same animal representation or replicated arrays of the same pool construction), and error in estimating animal contributions. All 32 animals, eight sub-pools, and four super pools were genotyped using the Illumina BovineHD array (Illumina, Inc., San Diego, CA, USA) by Neogen Corporation (Lincoln, NE, USA).
Statistical analysis: For each sub-pool and super pool, we estimated the contribution of each of the 32 animals in the eight sub-pools (four animals that contribute to each sub-pool) and four super pools (16 animals in each) using quadratic programming to minimize the residual sums of squares, subject to the constraints that the estimated animal contributions were positive and summed to 1, using the solve.QP() within contributed package quadprog in R [
9,
10] with Illumina θ for super pool or sub-pool as the dependent variable and genotype (number of copies of B allele)/2 as the independent variable.
Theoretically, pool construction error should be greater for larger planned animal contributions compared to smaller planned animal contributions based on the Dirichlet distribution, which is commonly assumed for the probabilities underlying the multinomial distribution. We tested for equality of variance in pool construction error among groups of planned contributions using a Levene type test, which is robust to deviations from normality [
11] using the levene test function in R [
10]. There were four unique values among animal contributions to sub-pools and nine unique values among animal contributions to super pools. The Levene test requires greater replication within planned contribution level than our experiment allowed to achieve adequate power to detect differences in pool construction variance. In our analysis, we clustered similarly planned comparisons to overcome the small number of replicates within planned comparison level (
Table A3) and looked at sensitivity to the level of granularity or aggregation to evaluate the robustness of our results. We used default parameters with the exception of correction.method = “zero.correction”, kruskal.test = TRUE, bootstrap = TRUE, and num.bootstrap = 100,000, which implemented a bootstrap rank-based (Kruskal-Wallis) modified robust Brown-Forsythe Levene-type test based on the absolute deviations from the median with modified structural zero removal method and correction factor.
We evaluated technical error as influenced by the number of SNP by subsampling all 777,962 SNP without replacement, computing animal contributions for each subsample, estimating the standard deviation for each animal across subsample, and averaging the result across sub-pools and super pools within the number of subsamples. The number of subsamples and SNP per subsample are in
Table 1.
To evaluate the consistency of proportionality between sub-pools and super pools, we regressed animal contribution to the super pool on animal contribution to the sub-pool. If the r2 from this analysis is high, then there is a strong proportionality between animal contributions to sub-pools and super pools, and the estimated contribution of each animal is not affected much by being diluted in a pool of additional animals.
We estimated haplotypes without pedigree using Beagle version 5.2 [
12] and hap-ibd version 1.0 [
13] to identify shared identity by descent segments among the 32 liver samples plus hapmap animals [
14].
Breed composition of the 32 liver samples was estimated using a multiple regression method [
15], with the exception that we constrained the breed contributions to sum to 1 and be ≥ 0 using quadratic programming [
10], and the breed SNP frequency reference data were derived from BovineHD 770 k data for multiple diverse breeds [
14].
4. Discussion
Pool construction error increased with planned animal contribution (
p ≤ 0.0154), which implies that pool construction error decreases with an increasing numbers of animals equally represented in a pool, because planned contribution is the reciprocal of the number of animals equally represented. Based on these results, we recommend more animals per pool, as also supported by previous literature [
18].
Technical error in animal contributions decreases as more SNP are used to estimate animal contribution, which is consistent with the Central Limit Theorem coming into play and reducing technical error in estimating animal contributions as more SNP are included in the computation. An implication of this result is that we can accurately estimate haplotype contributions within a chromosomal region if there are adequate numbers of SNP within the region.
The four animals in each sub-pool are each represented in two super pools at two different dilutions. The proportion of the animals in the sub-pool were strongly correlated with the proportion in the two super pools regardless of the dilution; furthermore, the two dilutions in the super pools were strongly correlated. This finding suggests that pools of animals with extreme phenotypes from different breeds can be combined into larger pools to save money. Similarly, pools of animals with extreme phenotypes from different seasons, pens within a feedlot, feedlots, and pastures can be combined. Commercial feedlot cattle being collected from a packing plant are generally comprised of multiple breeds and crossbreeds, and phenotypically extreme animals in a particular pen of animals is likely to contain more than one breed; indeed, in most circumstances we do not know the breed makeup when we are processing the samples into pools. The phenotypic extremes of animals within a particular pen may not comprise of very many animals. For example, the top 5% of 200 animals in a pen is only 10 animals. Combining animals with extreme phenotypes from 10 pens to make one pool of 100 animals results in a savings of 90% relative to genotyping 10 pools with 10 animals each.
Although not typically thought of this way, the number of copies of a B allele for an individual divided by two is the allele frequency of two haplotypes in the individual, one of maternal and the other of paternal origin. When we regress Illumina θ for a pool of genotypes for individuals, we are estimating the representation of pools of two haplotypes in a larger pool context. If phenotypically extreme animals in the pool share chromosomal regions IBD with other animals not in the pool, then the distribution of haplotypes of animals not in the pool can be accurately estimated, and those haplotype contributions can be used to inform the estimated breeding value of other animals with shared IBD through the IBD sharing. Using 718 animals from 18 diverse breeds, we found that all 32 animals in our pools each shared between 85 and 95 % of their genomes in IBD with at least one reference haplotype from 746 or 734 animals not in the pool representing multiple diverse breeds. This demonstrates that reference haplotypes from approximately 750 diverse animals (1500 haplotypes) is sufficient to cover IBD for 85 to 95% of the genome for a purebred Holstein or crossbred beef animal; it all hinges on whether the haplotypes in the pool are covered by reference haplotypes, and they were in this case. It is unknown whether the high coverage in this case was due to a small sample of Holstein haplotypes or due to fairly large haplotype segments being ubiquitous across populations as a result of historical natural and artificial selection or random drift [
19,
20].