Adjusted Sample Size Calculation for RNA-seq Data in the Presence of Confounding Covariates
Round 1
Reviewer 1 Report
The authors present an approach to compute sample sizes and power for RNA-seq based differential analysis. The novelty of the simulatoin approach is the addition of covariate information when computing the power. The manuscript is well written and the different simulation scenarios are interesting, and differences are clearly explained.
Major:
- The authors provide as supplementary information an R script with examples for some of the simulation and the data for the COAD analysis. I would find it great if the authors would add another R script that can be used to directly conduct the power analysis on the real data they provide. This would help other researchers to use that on their own data and increase thus the chance of real use to others. Currewntly it is this simple R script without explanation how to use it on the real data, additional explanation of usage is also encouraged.
Minor:
The manuscript should be spell-checked again here a some typos:
line 135: .. bars SHOW that ...
line 264: ... and A 0.8 nominal power ...
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Review of manuscript biomedinformatics-1175247 entitled “Covariate-adjusted Sample Size and Power Calculations for RNA-seq Data”
This paper proposes a method to estimate the number of replicates n required for each condition of an RNA-seq experiment to achieve some fixed first and second type error rates, in the presence of confounding covariates.
Major issues
- The authors should very briefly discuss, in the Introduction, the difference between independent covariates and confounding covariates, the former being controlled by the experimental design, the latter not. And they should then also highlight the purpose of the paper, which is to introduce confounding covariates for sample size estimation. Indeed, that’s not clear for me in the very beginning of the manuscript. Even the title should perhaps be “Adjusted sample size and power calculations for RNA-seq data in the presence of confounding covariates”, shouldn’t it? Perhaps, also, use only the term “confounder-adjusted” instead of “covariate-adjusted”? Or explain in the very beginning that, by abuse of language, both terms refer to confounders?
- Among the methods cited in the manuscript for sample size calculation, there are no methods dealing with independent covariates? If some methods do so, they also could be used for confounding covariates, isn’t it? If it is the case, shouldn’t the authors compare their results to these kind of benchmark methods?
- In page 3, lines 103-104, the authors say that “The actual power calculated from the simulation is shown on each bar graph with a minimum 0.8.”. In my opinion, the power is not really calculated in this study. It’s just à threshold to be reached in the procedure of calculation of n, as described in Materials and methods (page 18, lines 433-435). Then, the “actual” power give us no information, because it’s always about 0.80 or very slightly bigger. This should be remove from figures, shouldn’t it? Moreover, in my opinion, that should also be removed from the title, which would be “Adjusted sample size calculation for RNA-seq data in the presence of confounding covariates”?
- At page 15, lines 311-313, the authors underline that “As an alternative, a higher read depth sequencing may be chosen to increase the mean read counts for each sample, instead of directly increasing the sample size.” as it is shown in Lamarre et al. (2018). But, what about the library size (or depth) of samples in the current simulations? Is it taken into account? In my opinion, that should be underlined in the manuscript.
- Page 7, lines 154-159. The authors underline that “different distribution of covariates between the two groups such as a Poisson distribution with mean and variance of 10 and a normal distribution N_1(12,1) with mean of 12 and variance of 1 require a larger n compared with the same distribution of the covariate (Pois_0(10) and Pois_1(12)) or different distribution of the covariate with same variance Pois_0(10) & N_1(12,10).” In my opinion, that is explained by the fact that confounding covariates with very high variances (compared to the mean) are not really confounding, whatever their laws are, because they do not really discriminate enough both groups. Do the authors agree?
- For the real data analysis using COAD RNA-seq data, why do the authors assume that there are 500 genes that are DEGs (page 14, line 259)? Why didn’t they use all the RNA-Seq data to estimate this amount of “true” DEGs? Could this value have an impact on the results? Please, underline that on the manuscript.
- The authors argue that the method that they describe is very useful for sample size calculations when confounding variables exist. But how can we use it explicitly? Could the authors give a king of process to the users? Moreover, for the COAD data, for example, how would they chose n based on Table 2? The mean, the median, the maximum value? I really think that the analysis is interesting to better understand confounding variables, but is it really useful in practice?
- I think that there are too many typos and spelling mistakes. The manuscript should be read again carefully. See hereafter for some of them.
Minor issues
- Page 2, line 92. Parameters Psy_1 and Phi should perhaps be briefly described here to be understandable without reading the Materials and methods?
- The title of Table 1 is not clear. This table do not give the “Summary of the calculated sample size n and actual power for Figures 1 to 8.”, it does? It shows simulations characteristics, no?
- Other parameter to add in Table 1: the power at 0.80?
- The authors should perhaps comment in the Results why the values of n are not similar for a fold change of 0.5 and of 2? It’s simply because there is no symmetry between laws in H_0 and in H_1, isn’t it?
- For categorical variables, for example in page 7, lines 160-169, and in figures, it is not clear what, for example, (.1,.4,.4,.1) stand for? What is the proportion for H_0? For H_1? That should be written clearly?
- Page 16, 350-352. Why did the authors write these sentences? I don’t understand what they want to say, why they calculate the mean of size factors. In my opinion, the Equation (3) ends the sentence, isn’t it?
- Page 17, lines 398. The authors say that w is set to be 1 for an equal read depth and different of 1 for unequal read depth. But the size factor is not only related with the library size (cf Maza et al., 2013).
Typos and spelling mistakes
- Page 1, line 22. Terms “covariate-adjusted sample size into account” should be replaced by “confounding covariates into account”, isn’t it?
- Page 2, line 81. Replace “i.e,” by “i.e.”.
- Page 3, line 103. There is a “u_0” instead of a “mu_0”. And the same mistake appears many and many times in the manuscript.
- The authors write for example “N_0 & N_1” and sometimes “N_0 and N_1”. Could they homogenize this?
- There are some lowercase letters for the Greek letter phi in the manuscript, or in the legends of figures, and some uppercase ones. All should be the same, isn’t it?
- Page 4, line 123. The Greek letter rho in Figure 1 is not correct. And also in the other figures.
- Page 3, line 101. Replace “.2.1” by “2.1”.
- Page 6, line 140. It should be “effect” instead of “affect”, twice?
- Page 6, lines 137. Replace “N_0(30, 5^2) and N_1(40,5^2)” by “N_0(30, 3^2) and N_1(40,3^2)”? And, in line 139, replace “N_1(40,5^2)” by “N_1(42,5^2)”? All the manuscript should be read again to avoid such typos.
- Page 9, line 198. The authors write “significance level of size alpha” but they should simply write “significance level alpha”, shouldn’t they? And that should be changed all along the manuscript?
- Page 14, line 265. We should read “the expected significantly DEG” instead of “the expected true DEG”?
- Page 16, line 334. Replace “liner” by “linear”.
- Equations in page 16 should be read again. For instance, in line 353, the Z_i is missed? gamma_ij should be gamma_i? L_ij should be L_i? X_ij should be X_i? etc. etc.
- Page 18, line 417. Why is the alpha in italics and bold? It appears all along the page.
References
Lamarre S, Frasse P, Zouine M, Labourdette D, Sainderichin E, Hu G, Le Berre-Anton V, Bouzayen M, Maza E. Optimization of an RNA-Seq Differential Gene Expression Analysis Depending on Biological Replicate Number and Library Size. Front Plant Sci. 2018 Feb 14; 9:108.
Elie Maza, Pierre Frasse, Pavel Senin, Mondher Bouzayen & Mohamed Zouine (2013) Comparison of normalization methods for differential gene expression analysis in RNA-Seq experiments, Communicative & Integrative Biology, 6:6.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Review of manuscript biomedinformatics-1175247 entitled “Adjusted Sample Size Calculation for RNA-seq Data in the Presence of Confounding Covariates”
Almost all points of the first review have been tackled correctly. Nevertheless, hereafter are 3 remarks.
Remark 1:
The point (2) of the original review, rewritten below, has not been treated by the authors. They should look at it.
“2) Among the methods cited in the manuscript for sample size calculation, there are no methods dealing with independent covariates? If some methods do so, they also could be used for confounding covariates, isn’t it? If it is the case, shouldn’t the authors compare their results to these kind of benchmark methods?”
Remark 2:
Concerning the following point and response:
“(4) Page 7, lines 154-159. The authors underline that “different distribution of covariates between the two groups such as a Poisson distribution with mean and variance of 10 and a normal distribution N_1(12, 1) with mean of 12 and variance of 1 require a larger n compared with the same distribution of the covariate (Pois_0(10) and Pois_1(12) or different distribution of the covariate with same variance Pois_0(10) &N_1(12, 10). *in my opinion that is explained by the fact that confounding covariates with very high variances (compared to the mean) are not really confounding whatever their laws are because they do not really discriminate enough both groups. Do the authors agree?
Answer: Thank you for the comments.
We feel we did not write it clearly. We think high difference in variance can affect the sample size given an unequal mean between two conditions. We observed a larger n estimated from Pois_0(10) &N_1(12, 1) than from Pois_0(10) &N_1(12, 10) confounder and this is because the formal one has a variance of 10 in control group and a variance of 1 in the treatment group. But, the later has the same variance of 10 for the control and treatment groups (page 7, lines 173-174).”
I do not totally agree with the authors. Indeed, I don’t think that the “high difference in variance can affect the sample size”. I think that high variances themselves can affect the sample size. More precisely, I think that we should observe a larger sample size n estimated from N_0(10,1) & N_1(12, 1) than from Pois_0(10) & N_1(12, 1) than from Pois_0(10) &N_1(12, 10), i.e. N_0(10,1) & N_1(12, 1) > Pois_0(10) & N_1(12, 1) > Pois_0(10) & N_1(12, 10), isn’t it? Then n should be bigger for small variances, no matter if they are similar or not, isn’t it?
Remark 3.
I do not really understand what the following added sentence means: “But it can be estimated from the expected number of DEGs with normalized read counts.” (page 15, lines 344-345). The authors should rewrite it.
Author Response
Thank you very much for spending your time and helping revising our manuscript in your busy schedules. Thank you so much for all the insightful comments and suggestions. We really appreciate it. By the way, we like the good title you gave.
Please find the attached file. We address your remarks in point-by-point basis and made minor changes.
Once again, thank you very much!
Author Response File: Author Response.pdf