Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Estimation of Pool Construction and Technical Error

Agriculture 2021, 11(11), 1091; https://doi.org/10.3390/agriculture11111091

by John Keele^1,*, Tara McDaneld¹, Ty Lawrence²

, Jenny Jennings³ and Larry Kuehn¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Agriculture 2021, 11(11), 1091; https://doi.org/10.3390/agriculture11111091

Submission received: 20 September 2021 / Revised: 27 October 2021 / Accepted: 1 November 2021 / Published: 4 November 2021

(This article belongs to the Special Issue Application of Genetics and Genomics in Livestock Production)

Round 1

Reviewer 1 Report

The authors estimated errors of pooling. Sub-Pools of abscessed or normal liver tissues were constructed by taking different quantities of a tissues from different individuals, for various planned proportions. DNA extracted from sub-pools was then used to form super pools of abscessed or normal groups, based on DNA concentration in the Sub-Pools, again with different planned contribution of the Sub-Pools. They estimated technical error by comparing estimated animal contribution using sub-samples of SNPs from Illumina HD bovine SNPs array. Pool construction error, measured as deviations of observed animal contributions from the planned contribution, increased with planned contribution of individual animals (that is, the larger the number of individuals in the pool, the smaller the construction error), while technical error decreased with number of SNP used (more SNPs decrease technical error).

Molecular and statistical methods are sound and the conclusions are valid, albeit unnecessary too complicate.

No Tables! A lot of numeric data in the texts, no detailed numeric data in the figures. Seem that the author(s) don't like tables…

Specific comments

Suggest sub titles of the various sections.
Supplemental data are not mentioned. They should be, in Lines 75-82, for example.
Add "to" after compared.
Add "to the planned contribution," before "using".
L22-23. Not clear. Reword.
"GWAS" already include "study"…
L64-65. Note that pooling errors can include also weighing error, different DNA content in the tissue, and even recording errors (e.g., mixing samples, recording wrong DNA concentration).
L67-68. Not clear. Is "technical error" refer to estimate of the population by the pools, or to contribution to the pools?
Add "in this study".
L89 and further, and Figures 1, 2. The text discuss "super pools", while "pools" are presented in these figures.
L93-94. Define "n". 0.1 g tissue?
L94-95. What was the source of the individuals DNA?
Figure 1 present both sub- and super pools. Add "to the Sub-Pools," before "ranging", and ", and varying from 0.01 to 0.16 in the super pools" after ")".
L110-112 and later on. Not clear. What is "variation in B allele frequency within genotype"? By definition, Allele B can only be 0, 0.5 and 1, within genotypes AA, AB and BB. Can't be any variation, by definition. There could be variation in the population though. Is that what you mean?
In Figure 2 too you present sub- and super pools together. Add, "Pooling sub and super pools together," before "There".
L129-130, 137. The subject is the number of unique values among all planned contributions. The number of contributions of the animal is presented in Line 137. Move "4 super pools * 16 animals per super pool + 8 sub-pools * 4 animals per sub-pool" to L137 instead of "13 distinct values", and change here to "4 unique values among the Sub-Pools + 9 unique values among the Super-Pools".
Show the calculation of the contributions and add a table to show the values. Something like:

Sub-Pool	Ind	Sub	Super 1	Super 2	Final 1	Final 2
1	1	0.1	0.1	0.3	0.01	0.03
	2	0.2	0.1	0.3	0.02	0.06
	3	0.3	0.1	0.3	0.03	0.09
	4	0.4	0.1	0.3	0.04	0.12
2	5	0.1	0.2	0.4	0.02	0.04
	6	0.2	0.2	0.4	0.04	0.08
	7	0.3	0.2	0.4	0.06	0.12
	8	0.4	0.2	0.4	0.08	0.16
3	9	0.1	0.3	0.1	0.03	0.01
	10	0.2	0.3	0.1	0.06	0.02
	11	0.3	0.3	0.1	0.09	0.03
	12	0.4	0.3	0.1	0.12	0.04
4	13	0.1	0.4	0.2	0.04	0.02
	14	0.2	0.4	0.2	0.08	0.04
	15	0.3	0.4	0.2	0.12	0.06
	16	0.4	0.4	0.2	0.16	0.08

And/or

Pool	Contribution
Sub	0.1
Sub	0.2
Sub	0.3
Sub	0.4
Super	0.01
Super	0.02
Super	0.03
Super	0.04
Super	0.06
Super	0.08
Super	0.09
Super	0.12
Super	0.16

L132-140. Present a table with details of the clusters.
Add "Hence," before "We".
L150-151 vs. L187-189. 10, 40 and 50 sub samples have different number of SNPs and the order of the presented subsamples is reversed (not user friendly).
L156-158 and later on. Note that normalized to sum of 1 is simply the proportion of the animal in the pool.
Add references after "intact" and "IBD".
L171 and Figure 2. Impossible to judge. Present a table with of both contributions and variances.
L171-172. Isn't the sentence start with "Variation" repeat the previous sentence?
L172-176. Present a detailed table the analysis of the clusters.
Why the reference Peiris is presented here and not in the References? Why attached to Figure 3? Is this figure come from Peiris? Or for details see that reference? Or what?
L182-184. Present a table and/or a graph presenting in details the effect of the number of SNPs on the technical error.
Change "ibd" to "IBD".
Figure 1.
- Shift to the right the proportional contribution of individuals 9-10.
- Add "presented above the individual number" after "tissue".
Figure 2. Suggest 0.4x0.4.
Figure 3.
- What are the genotypes series? Individuals? Population frequencies? (See above comment on within genotype frequency).
- Unlike to the legend, there are no lines for the genotype series.
- Maximum X and Y should be 1.0.
- At least in this marker, the pools overestimate B, always overestimate. An unbiased estimate should be distributed above and below the slope. Are the pools really accurate?
- Present numeric values on the plot and/or in a table.
Change "or" to "expressed as".
L221-222. Delete "number of copies of the B allele for individual genotypes or allele".
L223-225. The red dashed line.
- As said above, B Allele frequency for heterozygote = 0.5, hence the red dashed line must be population frequency. Is it?
- What is the diagonal segment at the beginning of the line?
- How is this line relate to the pools?
Figure 4.
- What are the lines from up left? Not mentioned in the test or presented in the legend.
- What are the frames around the colored bands? SD?
- Who are the 3 groups of Holsteins? Specifically, I can't find any mention of the 38 and 16.
L235-237. The reciprocal relation between animal contribution and the size of the pool (number of animals), should present in the start of this paper.
Add a period at the end.
Change period to comma after "individuals".

Author Response

Molecular and statistical methods are sound and the conclusions are valid, albeit unnecessary too complicate.

In addition to unnecessarily complicating analyses, the author tend to complicate the text. It is is very wordy, with redundant repeats, and very long nearly unreadable sentences (for example, the sentence start with "furthermore…" in Lines 29-35 is endless). They are discussing proportional contribution, while actually measuring it by allele frequency in pools and individuals (as others did before). Author response: These lines are now divided into 3 sentences. We are actually writing about variable effects of loci on phenotypes in different parts of the genome. We clarified.

No attention is given to the difference between sub and super pools. The number of individuals in a pool was different, and so were the construction methods. Tissues weights were used to construct the small sub pools, while extracted DNA constructed the larger super pools. To start with, the difference between abscessed and normal tissue may affect DNA content, extraction efficacy, hence variation of contribution. Author response: We added Levene test for equality of variance between tissue and DNA based pooling for pool construction error. P ≤ 0.045. The liver tissue that we used wasn’t specifically from within the abscess but as you say it is possible that the abscess affects the integrity of the liver and postmortem changes within it.

No Tables! A lot of numeric data in the texts, no detailed numeric data in the figures. Seem that the author(s) don't like tables… Author response: We added tables which does reduce complexity of the text.

Specific comments

Suggest sub titles of the various sections. We added subtitles – Pool construction error, technical error, Proportionality of animal contributions conserved with dilution, and Identify by descent sharing to the Results section.
Supplemental data are not mentioned. They should be, in Lines 75-82, for example. We added reference to raw data here which is at Ag. Data commons.
Add "to" after compared. Changed.
Add "to the planned contribution," before "using". Changed.
L22-23. Not clear. Reword. Clarified distinction between B allele frequency or frequency of B allele and theta which is a measure of pooling allele frequency in this study. We chose to do this because Illumina uses theta to call genotypes and to quantify copy number. Estimating pooling allele frequency is similar statistically to estimating copy number.
"GWAS" already include "study"… We removed “study”.
L64-65. Note that pooling errors can include also weighing error, different DNA content in the tissue, and even recording errors (e.g., mixing samples, recording wrong DNA concentration). We made this change.
L67-68. Not clear. Is "technical error" refer to estimate of the population by the pools, or to contribution to the pools? Technical error is the result of variation in theta among replicated arrays for the same animal or pool of animals in the same proportion across replicate arrays. We do not replicate arrays because it is less expensive to estimate technical error by bootstrapping or down sampling SNP; computing the variance among animal contributions for sub-samples of SNP sampled without replacement.
Add "in this study". Accepted.
L89 and further, and Figures 1, 2. The text discuss "super pools", while "pools" are presented in these figures. We made the pool terminology consistent in the figures and text.
L93-94. Define "n". 0.1 g tissue? We rewrote this and no longer use n. n is the number of planned contributions with this value.
L94-95. What was the source of the individuals DNA? Liver. We clarified.
Figure 1 present both sub- and super pools. Add "to the Sub-Pools," before "ranging", and ", and varying from 0.01 to 0.16 in the super pools" after ")". Accepted.
L110-112 and later on. Not clear. What is "variation in B allele frequency within genotype"? By definition, Allele B can only be 0, 0.5 and 1, within genotypes AA, AB and BB. Can't be any variation, by definition. There could be variation in the population though. Is that what you mean? We use θ = 2 * tan^-1(Y/X) / π for pooling allele frequency. θ does vary within genotype or among technical reps within a pool (same animals in same proportions). Illumina uses B allele frequency an θ as synonyms and we agree that it is confusing. However, we do think it is a good idea to standardize and minimize multiple terms for the same thing which is why we chose to use Illumina’s θ. We refer to θ as a proxy for pooling allele frequency which is still confusing but seems to be the best option. Illumina uses θ for calling genotypes and estimating copy number; so it is in wide use.
In Figure 2 too you present sub- and super pools together. Add, "Pooling sub and super pools together," before "There". Accepted.
L129-130, 137. The subject is the number of unique values among all planned contributions. The number of contributions of the animal is presented in Line 137. Move "4 super pools * 16 animals per super pool + 8 sub-pools * 4 animals per sub-pool" to L137 instead of "13 distinct values", and change here to "4 unique values among the Sub-Pools + 9 unique values among the Super-Pools". Accepted.
Show the calculation of the contributions and add a table to show the values. Something like:

Sub-Pool	Ind	Sub	Super 1	Super 2	Final 1	Final 2
1	1	0.1	0.1	0.3	0.01	0.03
	2	0.2	0.1	0.3	0.02	0.06
	3	0.3	0.1	0.3	0.03	0.09
	4	0.4	0.1	0.3	0.04	0.12
2	5	0.1	0.2	0.4	0.02	0.04
	6	0.2	0.2	0.4	0.04	0.08
	7	0.3	0.2	0.4	0.06	0.12
	8	0.4	0.2	0.4	0.08	0.16
3	9	0.1	0.3	0.1	0.03	0.01
	10	0.2	0.3	0.1	0.06	0.02
	11	0.3	0.3	0.1	0.09	0.03
	12	0.4	0.3	0.1	0.12	0.04
4	13	0.1	0.4	0.2	0.04	0.02
	14	0.2	0.4	0.2	0.08	0.04
	15	0.3	0.4	0.2	0.12	0.06
	16	0.4	0.4	0.2	0.16	0.08

And/or

Pool	Contribution
Sub	0.1
Sub	0.2
Sub	0.3
Sub	0.4
Super	0.01
Super	0.02
Super	0.03
Super	0.04
Super	0.06
Super	0.08
Super	0.09
Super	0.12
Super	0.16

We appreciate Reviewer 1 providing this example. We added a similar table which is placed close to where Figure 1 is discussed.

L132-140. Present a table with details of the clusters. Accepted.
Add "Hence," before "We". Accepted.
L150-151 vs. L187-189. 10, 40 and 50 sub samples have different number of SNPs and the order of the presented subsamples is reversed (not user friendly). We now present these results in a table.
L156-158 and later on. Note that normalized to sum of 1 is simply the proportion of the animal in the pool. So actually we are looking at the proportion of the animals in the sub-pool after diluting into the super-pool. We take your point though and use it to simplify our writing in that we don’t need to redundantly write normalize so the sum is 1 because we are using the word proportion. Basically the point we are trying to make is that the proportionality of the animals in the sub-pool is conserved following dilution of the sub-pool into the super pool.
Add references after "intact" and "IBD". We added references to support use of Beagle and hap-ibd for this purpose.
L171 and Figure 2. Impossible to judge. Present a table with of both contributions and variances. Reviewer is correct rigorous inference is not possible from the table. However, Levene test evaluates relationship between planned contributions and pool construction error. We hope the clarification of the Levene testing helps with this confusion.
L171-172. Isn't the sentence start with "Variation" repeat the previous sentence? Yes. We removed redundancy.
L172-176. Present a detailed table the analysis of the clusters. Accepted.
Why the reference Peiris is presented here and not in the References? Why attached to Figure 3? Is this figure come from Peiris? Or for details see that reference? Or what? Peiris et al. wrote about normalizing pooling allele frequency using difference in X/Y ratio for heterozygotes. Basically, variation in X/Y within and between SNP are the source of technical error. Our goal in this paper was to show the utilizing multiple SNP we can overcome this technical error in XY but it takes multiple SNP. Figure 3 is showing variation in θ around the allele frequency line from individual genotypes and planned contribution as well as for individual genotypes (0, .5, and 1). We changed the X and Y labels to reflect different more consistent terminology.
L182-184. Present a table and/or a graph presenting in details the effect of the number of SNPs on the technical error. Accepted.
Change "ibd" to "IBD". Accepted.
Figure 1.

Shift to the right the proportional contribution of individuals 9-10. Accepted.
Add "presented above the individual number" after "tissue". Accepted.

Figure 2. Suggest 0.4x0.4. Assume referring to the maximum X and Y. Some Y values are greater than 0.4. Will use 0.5x0.5.
Figure 3.

What are the genotypes series? Individuals? Population frequencies? (See above comment on within genotype frequency). The genotype points come from individual animals, Y are θ values and X are copies of B allele or genotypes coded 0, .5, and 1. For pools the X values are weighted average genotypes (0, .5 and 1) using planned contributions. Y values are θ. These results are all for one SNP. I didn’t want to show more than one figure for this. Some SNP are closer to the line X/Y for heterozygotes near 1 and others fall below the line. We added 2 SNP with an a and b figure.
Unlike to the legend, there are no lines for the genotype series. Lines have been removed.
Maximum X and Y should be 1.0. True.
At least in this marker, the pools overestimate B, always overestimate. An unbiased estimate should be distributed above and below the slope. Are the pools really accurate? We add reference to Peiris et al here because there is variation above and below the line. We add a 3 part figure with a.,b. and c; one above the line, one on the line and one below. We want to show that this happens at the individual SNP level, but for multiple SNP results for individuals (pool of 2 haplotypes) or individual haplotypes this within and between SNP variation gets normalized by the central limit theorem and multiple SNP.
Present numeric values on the plot and/or in a table. That would be too complicated with the changes of adding 2 additions SNP.

Change "or" to "expressed as". Accepted.
L221-222. Delete "number of copies of the B allele for individual genotypes or allele".
L223-225. The red dashed line.

As said above, B Allele frequency for heterozygote = 0.5, hence the red dashed line must be population frequency. Is it? It might be. That value came from the Illumina datasheet that has and is being used to create clusters to call genotypes based on θ. It will change if there is a copy number variant specific to a population or neighboring SNP that interfere with hybridization to the probe fragments on the bead. I’m just trying to show that the value from Illumina is over estimated both for heterozygotes as well as for pools; the error is consistent. This this the basis why many people adjust pooling for X/Y of heterozygotes. Using multiple SNP normalizes this error.
What is the diagonal segment at the beginning of the line? I don’t see a diagonal segment. Artifact of file transfer. I’m not sure.
How is this line relate to the pools? For this marker mean X/Y greater than 1 and mean θ is greater than 0.5 which is consistent with what the pools are doing. Peiris et al. is right in this case. However, X/Y adjustment is unnecessary and contributes to error if the purpose of pooling is estimation of haplotype or individual contributions because these entities are estimated using multiple SNP. We clarify these points in the discussion.

Figure 4.

What are the lines from up left? Not mentioned in the test or presented in the legend. I do not see those in the version I am working with which came back from Agriculture. I think it might be a MAC pc incompatibility in how the png file coming from R is interpreted. I work with some MAC users and will make sure this doesn’t happen on my end.
What are the frames around the colored bands? SD? 25 and 75 % quantiles.
Who are the 3 groups of Holsteins? Specifically, I can't find any mention of the 38 and 16. The green are all outside the pool animals (which is why there is no variation) , the black are half of the animals outside the pool and the red are 25 % of the animals outside the pool. They vary because of bootstrapping. The red and black results are bootstrapped so it is a different sample of ½ or ¼ of the animals. I still think this is a good question but probably not enough space to explain adequately. Basically, if there are fewer animals in the haplotype reference database then there are fewer haplotypes to match to; smaller set of haplotypes results in less matching and less IBD. We omitted figure 4 in the revision and present data using the full 718 animals from hapmap which includes other breeds as well.

L235-237. The reciprocal relation between animal contribution and the size of the pool (number of animals), should present in the start of this paper. Accepted.
Add a period at the end. Accepted on L260.
Change period to comma after "individuals". Accepted on L264.

Author Response File: Author Response.docx

Reviewer 2 Report

Author Response

Genetic assessment of individuals and populations of animals and plants is an effective method for us to understand and utilize them. However, the acquisition of some DNA samples is limited, so we try to optimize some methods to reduce the amount of DNA samples without affecting the accuracy of DNA detection results. Mixing individual DNA samples is an effective way to reduce the amount of DNA used . Pooling in our context is to reduce the cost of genotyping and DNA extraction when that is possible. We don’t necessarily use less DNA. Generally speaking, this approach is useful when phenotypes are inexpensive (for example can be collected from a processing plant), sampling is inexpensive and DNA content of tissue isn’t too variable so that it is possible to avoid extracting DNA from samples individually. In this study, the quality of DNA mixing pool was evaluated by some technical methods and some meaningful results were obtained. Yes. The realized or estimated animal contribution was important us in this study and we evaluated the influence of errors in measurement and forming pools. The paper requires minor revisions.
What is the basis for using abscessed livers and non-abscessed livers as objects of study?Only 16 animal samples were used in this paper, which is a small number. These sample were collected as part of a larger project studying the genetic basis underlying risk of liver abscess. We treat liver abscess and normal livers as replicates in this study because numbers are not sufficient to capture and utilize disease association. The small animal contributions of 1 % is the same as an individual animals contribution when 100 animals are pooled with equal representation. Also the work required individual genotyping and we wanted to avoid the cost of genotyping large numbers of animals studying large pool sizes. Hence, the experiment was designed to evaluate the influence of pool size on error by pooling animals with varying proportions.
The number of SNP detection results proposed in this paper should be presented in a table. Accepted.

Round 2

Reviewer 1 Report

In the abstract and throughout the text, emphasize the inverse relation between individual contribution and the number of individuals in the pool (the size of the pool). Larger contribution → smaller pool → smaller sample of individuals, and vice versa.
Suggest "cluster\ed" instead of "lump/ed".
Suggest to use "Illumina θ" instead of "θ".
L130. Add "s" after "Table".
Table A2. Keep the order of work and text: first Sub-Pools, then Super.
L146. Add "(Table A2)" after "%".
L289. Add ")" after "Table 1".
L291. Add a comma after ")".
L294. Add a comma after "granularity".
"Table 2", isn't it?
746 animals? Explain this calculation. No 718 nor 4 are mentioned before. 718 animals from 18 diverse breeds (L528)? hapmap animals? 4 crossbred beef females (L387)?
718 + 32 - 16 = 734 (and L530).
L386-384. What the last sentence is referring to? The calculation above it?
Change period to comma after "individuals".
To authors reply 2. I meant quote in the text.
To authors reply 22. Since you deleted the relevant sentences, no need to add references.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Article Menu

Printed Edition

Estimation of Pool Construction and Technical Error

Further Information

Guidelines

MDPI Initiatives

Follow MDPI