Next Article in Journal
Modeling Uncertainty in Ordinal Regression: The Uncertainty Rating Scale Model
Previous Article in Journal
An Analysis of Vectorised Automatic Differentiation for Statistical Applications
 
 
Article
Peer-Review Record

Revisiting the Replication Crisis and the Untrustworthiness of Empirical Evidence

by Aris Spanos
Reviewer 1: Anonymous
Reviewer 3: Anonymous
Submission received: 1 April 2025 / Revised: 28 April 2025 / Accepted: 15 May 2025 / Published: 20 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I appreciate the invitation to review this paper. I am a biostatistician who specializes in biomedical research analytics with a focus on populational assessments of health-related topics. This paper has important links to my work and reflects well for some aspects to my professional experience. However, although I believe the main thesis in this article is an important argument but I am not sure if this is done the best way.

I interpret the main argument of this paper on how replicability is at the mercy of sloppy cookie cutter pipelines that overlook the nature of the assessments. Anyone experienced in applied statistics will eventually get to understand that it is not knowing how to do the test but knowing when it is appropriate to do the test what matters. Following that idea there is also subsequent understanding that these tests will be ultimately interpreted by many who do not understand and therefore care of the appropriateness of these assessments.

With these points in mind, my main concern with this paper is that I am not sure who is paper aimed to. I tell my students to pay attention on how they tell the story, understanding well the people who need to hear about the story so the narrative can be tailored to them is as important as the message itself. If this paper is aimed to a broader audience, this may not be the best approach. If this is for statisticians, I believe it is already implied. So, I am confused.

I can say that I agree with the argument presented even when I do not think the statistical technical aspect is the bottleneck in the replicability issue. I am more worried with overlooking context and poor understanding of the biosocial component that is inherent in every person involved in any study (participants, researchers, analysts…). I personally would attribute most of the non-replicability to poor understanding of this environmental social context.

This paper cites heavily Ioannidis work who is an interesting case study and a cautionary tale of “being right until you are wrong”. Ioannidis developed and used his narrative well through his career until he crossed paths with the COVID pandemic. There he generalized his approach with poor outcomes on an event that was not generalizable.

I think the paper is fine although it could be better if it was clearer who it is aimed to. The paper is so dense in specific arguments that it is hard to review.

Author Response

Reviewer: 1: Comments to the Author

    Reconsider after major revisions (substantial revisions to text or experimental methods needed). I appreciate the invitation to review this paper. I am a biostatistician who specializes in biomedical research analytics with a focus on populational assessments of health related topics. This paper has important links to my work and reflects well for some aspects to my professional experience. However, although I believe the main thesis in this article is an important argument but I am not sure if this is done the best way.

    I interpret the main argument of this paper on how replicability is at the mercy of sloppy cookie cutter pipelines that overlook the nature of the assessments. Anyone experienced in applied statistics will eventually get to understand that it is not knowing how to do the test but knowing when it is appropriate to do the test what matters. Following that idea there is also subsequent understanding that these tests will be ultimately interpreted by many who do not understand and therefore care of the appropriateness of these assessments.

    With these points in mind, my main concern with this paper is that I am not sure who is paper aimed to. I tell my students to pay attention on how they tell the story, understanding well the people who need to hear about the story so the narrative can be tailored to them is as important as the message itself. If this paper is aimed to a broader audience, this may not be the best approach. If this is for statisticians, I believe it is already implied. So, I am confused. I can say that I agree with the argument presented even when I do not think the statistical technical aspect is the bottleneck in the replicability issue. I am more worried with overlooking context and poor understanding of the biosocial component that is inherent in every person involved in any study (participants, researchers, analysts…). I personally would attribute most of the non-replicability to poor understanding of this environmental social context.

AUTHOR: I greatly appreciate the reviewer's comments, insights and criticisms, and I would do my best to address them. I begin with the key issued raised by the reviewer is "I am not sure who is paper aimed to". I have given seminars and presentations based on the manuscript under review to very different audiences, including Psychologists, Economists, Medical doctors interested in statistics, Philosophers of Science and Statisticians. The reactions from each of the different audiences were markedly dissimilar, and after numerous revisions to take into account the different reactions, I concluded that it's an impossible task to cater for all these diverse audiences with markedly different backgrounds in probability theory and statistics, and had different ideas on what statistics and evidence are all about!

    After more than a decade working on the manuscript under review, I submitted it to a statistics journal because the common denominator for all these diverse audiences is statistical modeling and inference, which is primarily applied mathematics. The confusions I detected in the different audiences were mostly stemming from inadequate understanding of probability theory and statistical inference. Hence, the emphasis on delineating the probabilistic concepts and statistical procedures in the manuscript under review. To do that it was necessary to employ precise mathematical notation to define the various statistical concepts and procedures, which might seem pedantic for a mathematical statistician, and utterly unnecessary for a psychologist. The statisticians are likely to find some of the methodological discussions unnecessary, but the philosophers of science interested in evidence from data would consider them absolutely necessary. As a result of these conflicting perspectives, I decided to focus mainly on delineating the concepts and procedures used in statistical modeling and inference, as they relate to replication and trustworthy evidence. Hence, submitting the paper to a statistics journal.

   

Reviewer: 1: This paper cites heavily Ioannidis work who is an interesting case study and a cautionary tale of "being right until you are wrong". Ioannidis developed and used his narrative well through his career until he crossed paths with the COVID pandemic. There he generalized his approach with poor outcomes on an event that was not generalizable.

AUTHOR: the reviewer is absolutely spot on about Ioannidis and his fame!

    His 2005 paper, however, in addition to the replicability problem, raised the issue of untrustworthy evidence which has been endemic in several disciplines. Well-before the COVID pandemic, I decided that I would reply to his initial paper and the numerous papers that take his arguments at face value and elaborate, creating an industry in several different disciplines. A simple Google Scholar search of "replication crisis" would reveal more than 700,000 citations, and the Ioannidis (2005) paper has more than 14000 citations.

Reviewer: 1: I think the paper is fine although it could be better if it was clearer who it is aimed to. The paper is so dense in specific arguments that it is hard to review.

AUTHOR: I appreciate the comment, and I did my best to re-write certain sections of the manuscript under review, such as the abstract, section 1, 2.1, 3-6 with a view to improve the arguments by rendering them more focused and less dense.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a sharp critique of the widely accepted diagnosis of the replication, especially the role attributed to frequentist statistical testing. It indeed regards it as a true “replication crisis”. The author offers an alternative rooted in the Fisher’s framework, arguing and proposing post-data severity evaluations.

The manuscript is clear and well written but would benefit from some minor clarifications.

 

  • Critiques of Ioannidis (2005) are revisited multiple times with no clear benefits to the reader. Please be more concise.
  • Reformat Table 4. Display each entry as an equation (\phi_1=.., \phi_2=…)
  • Section 5 on post-data severity is one of the highlights of the paper. Consider introducing this concept earlier to set the stage for the alternative view. Perhaps in the Introduction when you mention your objectives.
  • Rather than just cite Poudyal & Spanos (2022), include one short replication of a classic paper showing how SEV changes the inference.
  • Do similar replicability/trustworthiness issues arise in Bayesian or machine-learning workflows? Please discuss.

 

Author Response

Reviewer 2: Comments to the Author Accept after minor revisions (corrections to minor methodological errors and text editing). This paper presents a sharp critique of the widely accepted diagnosis of the replication, especially the role attributed to frequentist statistical testing. It indeed regards it as a true “replication crisis”. The author offers an alternative rooted in the Fisher’s framework, arguing and proposing post-data severity evaluations. The manuscript is clear and well written but would benefit from some minor clarifications.

AUTHOR: I greatly appreciate the reviewer’s comments, insights and suggestions.

Reviewer-2: Critiques of Ioannidis (2005) are revisited multiple times with no clear benefits to the reader. Please be more concise.

 AUTHOR: I revised certain sections, the abstract, section 1, 2.1 and 3, of the manuscript under review to address the issue raised by the reviewer.

Reviewer-2: Reformat Table 4. Display each entry as an equation (\phi_1=.., \phi_2=. . . )

AUTHOR: Thank you; Done!

Reviewer-2: Section 5 on post-data severity is one of the highlights of the paper. Consider introducing this concept earlier to set the stage for the alternative view. Perhaps in the Introduction when you mention your objectives.

AUTHOR: Thanks for the comment. The revised manuscript introduces the post-data severity in more detail in section 1, when summarizing the primary objectives of the paper.

Reviewer-2: Rather than just cite Poudyal & Spanos (2022), include one short replication of a classic paper showing how SEV changes the inference.

AUTHOR: I eliminated the citation to Poudyal & Spanos (2022) to avoid too many self-citations. I have added in section 3.2, howeveer, an example from Do and Spanos (2024) relating to the Phillips curve — very famous relationship in economics — demonstrating that the replication of highly cited papers reveals that most of them, if not all, are statistically misspecified. In section 5.3, I also added an example from Spanos (2023), comparing the claimed by the authors significance of a regression coefficient ¯ at ®=:05 with an estimated value ¯ b = :004 and a p-value p=:045 based on sample size n=24; 732; 966. The SEV 1 evaluation of that claim indicates that the warranted magnitude of the coefficient would be considerably smaller at ¯ ≤ :00000001 with high enough severity. This is due to the trade-off between the type I and II error probabilities.

Reviewer-2: Do similar replicability/trustworthiness issues arise in Bayesian or machine learning?

AUTHOR: Yes! In general, Bayesian statisticians take the likelihood function at face value, and ignore the validity of the probabilistic assumptions invoked by the distribution of the sample as well as the ensuing likelihood function! The situation in Machine Learning is even worse because statistical modeling is viewed as curve-fitting mathematical approximation functions, ignoring the fact that you cannot have statistical inference without probabilistic assumptions imposed (implicitly or explicitly) on the data. They conflate statistical adequacy with goodnessof-fit/prediction which is neither necessary nor sufficient for statistical adequacy.

See Spanos, A. (2022) "Statistical modeling and inference in the era of Data Science and Graphical Causal modeling", for a detailed discussion of these issues. I could not cite this paper to keep self-citations at a minimum

Reviewer 3 Report

Comments and Suggestions for Authors

Dear author, I have no further inquiries on your manuscript. Well done!

Author Response

I greatly appreciate your positive comments!

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I very much appreciate the response from the author. The context provided helped me understand the purpose of the paper better. It has completely changed my perception and opinion on the significance of it.

I like very much how many areas were simplified. Considering the context of a wider audience, this was a perfect decision. An example is section 2.1 where the argument is much more concise and removes some argument that were there previously for support but that were unnecessary.

I love the example added at the end on the significant t-test with a β=0.04, that example is a situation that I am well familiar with. It is ridiculous and I love that it was added right there in people’s faces. However, I must admit that the context for the motivation of researchers for emphasizing these findings is not naïve and is maybe just as important to discuss. Publication bias is a thing, and negative results are extremely hard to publish as there are no incentives to publish failure. For some researchers, clinging to a ridiculous finding like this one is the only option left.

Again, I appreciate the context, and now I see the motivation for having this published.

 

Back to TopTop