Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

How the Post-Data Severity Converts Testing Results into Evidence for or against Pertinent Inferential Claims

Entropy 2024, 26(1), 95; https://doi.org/10.3390/e26010095

by Aris Spanos

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Entropy 2024, 26(1), 95; https://doi.org/10.3390/e26010095

Submission received: 29 October 2023 / Revised: 29 November 2023 / Accepted: 29 December 2023 / Published: 22 January 2024

(This article belongs to the Special Issue Entropy, Statistical Evidence, and Scientific Inference: Evidence Functions in Theory and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

It is a beatifully insightful paper. I would only have one suggestion for the author. The author lists three ways in which the uninformed and recipe-like implementation of statistical modeling and inference contributes to the untrustworthiness of the empirical evidence provided. I can agree with tha author that all three "ways" are important. But it seems to me there is a fourth one, namely the unawareness of most empirical researchers of the sensitivity of model selection to the choice of loss function. Please see the 2021 article titled "Nonparametric tests for Optimal Predictive Ability" by Arvanitis et al. published in the International Journal of Forecasting a couple of years ago (https://www.sciencedirect.com/science/article/pii/S0169207020301564?via%3Dihub) for a sense of this sensitivity.

Author Response

Author’s response

I greatly appreciate the above comments by reviewer 1 and offer my thanks for reading the paper carefully. His suggestion to relate my discussion to selecting models on goodness-of-fit/prediction grounds would have increased the length of the paper and forced me to add more self-citations. I added a short paragraph that mentions the issues involved:

“The key weakness of this strategy is that excellent goodness-of-fit/prediction is neither necessary nor sufficient for the statistical adequacy of the selected model since it depends crucially on the invoked loss function whose choice is based on information other than the data. It can be shown that statistical models chosen on goodness-of-fit/prediction grounds are often statistically misspecified; see Spanos (2007).”

Reviewer 2 Report

Comments and Suggestions for Authors

Comments for author File: Comments.pdf

Author Response

Author’s response to reviewer 4

Referee report:

I learnt about the concept of statistical evidence from Royall's book 'Statistical Evidence' (Royall, 1997), his papers and personal discussions I had with him on the topic. Professor Royall has clearly influenced my thinking on evidence. I read Professor Spanos' paper with substantial interest hoping to learn an alternative statistical approach to quantifying evidence in the data. I have to say that, unfortunately, I have been underwhelmed by it.

I am going to point out a few of the reasons, although there are many more. General comments: 1. It is truly surprising to see a paper on statistical evidence that does not even mention the approaches developed by Royall, Birnbaum, Barnard among others. Are they irrelevant? Are they wrong? Is the author unaware of these (unlikely)? 2. It is well established by now that p-values do not quantify evidence. Neyman was also very clear that NP approach is about decision making and not about evidence. That some researchers continue to use them incorrectly is also well known. I would have expected a paper on statistical evidence to compare and contrast their approach with Royall' approach (which is probably the best explicated, if not the best, approach on statistical evidence).

Author’s response: I appreciate the honesty of the reviewer. The aim of the paper is not to evaluate Richard Royall’s (1997) book “Statistical Evidence: A likelihood paradigm”, or to overwhelm readers who adopted that particular perspective. If the journal were to ask me to write such a paper, I would be delighted to do it. For the reviewer’s information, I have already published three papers that bring out the key weaknesses of Royall’s likelihood approach to evidence, listed below, but I will be happy to elaborate and bring out additional weaknesses in a new paper:

“Revisiting the Likelihoodist Evidential Account,” Journal of Statistical Theory and Practice, 7, 187-195, 2013
“Who Should Be Afraid of the Jeffreys-Lindley Paradox?” Philosophy of Science, 80, 73-93, 2013.
“What Foundations for Statistical Modeling and Inference?’ Œconomia – History / Methodology / Philosophy, 2019, 9(4): 832-860. URL : http://journals.openedition.org/oeconomia/7521

Specific comments: Given a data set, there are two fundamental questions one asks :

(a) What is the strength of evidence for hypothesis A vis a vis hypothesis B?

(b) What is the uncertainty associated with this strength of evidence?

I am going to try to understand and evaluate this manuscript based on the answers to these questions. Author talks about quantifying evidence in the observed data but never rigorously defines what constitutes evidence in the data and its quantification. Without such a precise and clearly stated definition of statistical evidence, I do not know how one can evaluate whether the proposed approach is sensible or not. It is also necessary to specify how to quantify uncertainty in the strength of evidence.

Author’s response: I’m not sure whether the reviewer read the manuscript with an open mind to be persuaded or he/she was looking for an affirmation of his/her preferred approach to statistical evidence. The second page of the manuscript addresses both of the above questions when describing the primary focus of the paper:

The discussion revolves around the distinction between unduly data-specific ‘inference results’, such as point estimates, observed CIs, p-values, effect sizes, and the accept/reject H0 results, and ensuing inductive generalizations from such results in the form of ‘evidence for or against germane inferential claims’ framed in terms of the unknown parameters θ. The crucial difference between ‘results’ and ‘evidence’ is twofold:
(a) the evidence, in the form of warranted inferential claims, is framed in terms of post-data error probabilities aiming to account for the uncertainty arising from the fact that ‘inference results’ rely unduly on the particular data x0:=(x1, x2, ..., xn), which constitutes a single realization X=x0 of the sample X:=(X1, ..., Xn), and

(b) the evidence, in the form of warranted inferential claims, enhances learning from data x0 about the mechanism that could have generated x0.

Author claims Mayo and Spanos (2006) proposed the post-data severity (SEV) evaluation of the accept/reject H0 results as a way to convert these `results' into evidence for germane inferential claims. (Lines 306-307).

Let us go through the discussion in the paper following this statement.

1.Line 312: only when all the different ways it can be false have been adequately probed and forfended (Mayo, 1996). This is a very bold statement indeed. In real life, there are infinitely many ways a hypothesis can be false. Do you really mean one can envision ALL such ways?

Author’s response: I’m not sure whether the author realizes that we are focusing on a narrow statistical context, where the statistical model M_{θ}(x) defines the premises of inferences to be drawn. Assuming the validity of the premises, we derive statistical inference procedures for estimation, testing, prediction etc., that revolve around the relevant sampling distributions. These derivations are based on mathematical deduction, if M_{θ}(x) is valid, then certain inference propositions (optimality, error probabilities, etc.) are valid. The optimality and reliability of any inferences drawn, however, depend crucially on the invoked premises whose approximate validity (statistical adequacy) will secure the reliability of inference and the trustworthiness of the ensuring evidence. In this narrow statistical context the slogan “In real life, there are infinitely many ways a hypothesis can be false” is totally misplaced and unhelpful. There are broader contexts where one can probe for data issues, substantive issues, causality issues etc., etc.

Line 320-321: Hence, for the discussion that follows it is assumed that Mθ(x) is statistically adequate for the particular data x0. Is there a definition of and quantification of the 'statistically adequate'?

Author’s response: Yes there is. First ensure that your statistical model is specified in terms of a complete, internally consistent, and testable set of probabilistic assumptions. Second, you test those assumptions thoroughly using trenchant misspecification (M-S) testing to ensure their approximate validity (statistical adequacy). Third, if any of the model assumptions are invalid, you respecify it until you get a statistically adequate model that accounts for all the systematic statistical information in the particular data. I have published three textbooks and more than 60 papers on how one establishes the statistical adequacy of numerous statistical models in the literature using comprehensive misspecification (M-S) testing. I cannot keep addressing that particular issue in every paper I submit for publication to ensure that all the reviewers are fully informed; I cite several published papers where one can evaluate the effectiveness of the proposed M-S testing procedures. Indeed, a different reviewer for this manuscript raised bitter complaints about too many self-citations!

Line 326: The sampling distribution of d(X) evaluated under H0 (hypothetical) is: This is a strange statement given that the hypothesis is a composite hypothesis. Can I choose any value in the set of parameters? This comment is particularly relevant when there are nuisance parameters. Is the author really comparing simple vs composite hypothesis here? It would be good if such statements are made rigorous. I am sure author knows his statement is statistically loose. The distribution cannot be Binomial distribution in its strict sense. Why not just say the distribution is approximately Normal (0,1)?

Author’s response: The notion of simple vs. composite hypotheses and nuisance parameters are red herring. For composite hypotheses, the evaluation under the null needs only the largest μ, which is often μ0, in defining the type I error probability and the p-value. In deriving the type II error probability and power one needs to do that for all values in the alternative space. If you want to make a point criticizing the author, you should go through the derivations in the paper and point out what’s wrong with them, or “statistically loose”, as the reviewer claims. The sum of Bernoulli IID random variables is Binomially distributed, but that distribution can be approximated well for n>40 with a Normal distribution.

4.Lines 333-334: The post-data severity evaluation transforms the `accept/reject H0 results' into `evidence' for or against relevant inferential claims framed in terms of θ. This was a promising statement. Author then describes the post-data severity test as: A hypothesis H (H0 or H1) passes a severe test Tα with data x0 if (C1) x0 accords with H, and (C2) with very high probability, test Tα would have produced a result that `accords less well' with H than x0 does, if H were false (Mayo and Spanos, 2006, 2011). It is still unclear how all this quantifies strength of evidence and its uncertainty.

Author’s response: As stated in section 2.1 of the paper:

“The primary objective of model-based frequentist inference is to ‘learn from data x0’ about θ∗, where θ∗ denotes the ‘true’ θ in Θ. This is shorthand for saying that there exists a 126
θ∗∈Θ such that Mθ∗ (x)= f (x; θ∗), x∈Rn, could have generated x0.”

How good particular evidence is in the context of frequentist inference should be evaluated on how well the particular evidential account achieves this primary objective of ‘learning from data x0’ about θ∗.’

The “strength of evidence” is a loose notion that each one of us could impute a different meaning. The discrepancy from null value warranted by the particular data and test with probability .95 is something very specific, which gives rise to learning from data.

5.Line 344: optimal test: What is an optimal test here? Under what criterion? Neyman-Pearson criterion? This is problematic when there are nuisance parameters. Does post-data severity procedure need an optimal testing procedure to quantify evidence?

Author’s response: Yes the particular test is UMP in the N-P sense. There is nothing problematic about the tests used in this paper. There are no nuisance parameters in any of the tests discussed in this manuscript. The variance in the simple Normal model is not a nuisance parameter. It’s an integral part of the test which, when estimated changes the sampling distributions under both the null and alternative from Normal to Student’s t to account for the additional uncertainty stemming from estimating the variance.

What does it mean to say Data accords with H0?: Is this based on the one sided p-value? Again a major problem when there are nuisance parameters.

Author’s response: If the null is rejected at the particular alpha, the data accord with the null and the other way around if the null is rejected.

Lines 345-346: Broadly speaking, this result indicates that the `true' value θ* of θ lies within the interval (.5, 1) which is too coarse to engender any learning about θ*. This is clearly a strawman argument.

Author’s response: For the author there is nothing in this claim that could be described as a strawman argument. Learning from data is the primary objective and the inferential claim does contribute significantly to that objective as opposed to the accept/reject results.

Most researchers will use a two sided confidence interval which is (0.5121835,0.5318939). We have learnt something about the true value of θ using confidence interval. I suspect the evidence interval using Royall's argument will be very similar.

Author’s response: Most researchers are wrong. Several of my papers cited include principled arguments why that observed Cis cannot be legitimately interpreted in the way the reviewer prefers.

Line 362: however, can be informally interpreted as evidenceSo there is no formal definition of evidence in this approach?

Author’s response: The formal definition of evidence comes in the form of the relevant inferential claim based on the warranted discrepancy from the null values by the particular data and test with high enough probability.

Lines 365-367: It is also important to emphasize that the SEV evaluation of the inferential claim θ>θ1=θ0+γ1, γ1≥0 with discrepancy γ1=.0223, based on xn=5152 9869=.5223, gives rise to SEV(T> α ; x0; θ>.5223)=.5, which is no evidence for θ1≤.5223. Does this mean if severity is smaller than or equal to 0.5, it implies no evidence??

Author’s response: Yes it does!

Lines 368-370: Hence, the importance of distinguishing between `statistical results', such as xn=.5223, and `evidence' for or against inferential claims relating to xn. Is this similar to comparing the evidence for the MLE? 2 Evidence is for two pre-specied values of the parameters or models, not for the MLE.

Author’s response: No.

Linew 371-375: What is the nature of evidence the post-data severity (SEV) gives rise to? Since the objective of inference is to learn from data about phenomena of interest via learning about θ*, the evidence from the SEV comes in the form of an inferential claim the revolves around the discrepancy γ1 warranted by the particular data and test with high enough probability, pinpointing the neighborhood of θ* as closely as possible. Is this the quantication of evidence in this approach? An interval? Author should expand on this. How does this compare with the likelihood based evidence interval that Royall suggests?

Author’s response: I will be happy to write another paper to answer the questions of the reviewer.

Line 379: entailing θ*∈(.5, 1) down to θ*∈(.512, .5156)Is this the interval that a scientist can/should report? How should one use this interval in practice? If this is a statistical inferential statement, what is the uncertainty associated with this statement? Notice also that the lower limit of this interval is the same as the CI and this interval is nearly equal to the half CI. Is this by happenstance or is it true in general (for large n)? Notice also that the point estimate is not contained in the interval (0.512,0.5156). How does one interpret this interval?

Author’s response: as explained above: The formal definition of evidence comes in the form of the relevant inferential claim based on the warranted discrepancy from the null values by the particular data and test with high enough probability. The shortening of the accept/reject coarse result is an intuitive way to link the formal definition to the primary objective of frequentist inference.

Line 394: Is this `accept H0' result at odds with the previous `reject H0' result? Of course not. If one changes the hypothesis one can get acceptance for the new hypothesis while rejecting the old one.

Author’s response: there is nothing contradictory about the two testing results since they give rise to identical evidence.

Line 394: a feature of a sound account of statistical evidence What are the features of any sound account of statistical evidence. Is there a desiderata? I did not see any in the manuscript. Why does it imply 'robustness'? What is it robust against?

Author’s response: it is robust against different null values.

Line 397: more robust way to evaluate the replicability of empirical evidence. More robust than what? Which method are you comparing with? Usually 'more' is followed by 'than'.

Author’s response: as argued in the manuscript, inference results are unduly data-specific to be interpreted as evidence for or against particular inferential claims. Hence, for the replicability of evidence one needs a notion of evidence that replicates when the practitioner follows the right steps in ensuring the proper implementation of statistical procedures as well as the statistical adequacy of the invoked statistical model.

What is the meaning of 'replicability'? Are you suggesting that if the severity curves are identical for two different data sets, then the results are replicated successfully? Of course, they are never identical. Then the question arises, how far is too far? How close is close enough? Back to square one?

Author’s response: as argued in the manuscript, a statistical study is said to be replicable if its empirical results can be independently confirmed– with very similar or consistent results – by other researchers using akin data and modeling the same phenomenon of interest. It does not say identical, it says similar or consistent results!

One of the major issues with uncertainty quantication in the frequentist approach (I include Royall's approach in this class along with Neyman, Pearson, Fisher, Wald etc.) is the conditioning on the appropriate anciallary statistics. It is quite clear from many examples (See e.g. Lele 2020 and Casella and Goustis 1995), regression analysis being one of them, that conditioning is an essential component to quantifying the uncertainty in observed evidence. More informative data should lead to stronger (less uncertain?) evidence and vice versa. We do not know what null distribution 3 to compute: Conditional on the observed covariate values or unconditional on the covariate values? This is why we have two dierent types of bootstraps in the regression set up (Wu, 1986). I would like to see how the design variables (the observed covariate values) come into the evidential use of the severity tests.

Author’s response: the presence of ancillary statistics is not a problem for the post-data severity evaluation!

Lines 297-298: the unwarranted inferential claim in (16), i.e. an optimal estimator bθ(X) of θ justies the inferential claim bθ(x0)' θ* for n large enough.: Why is this not a valid conclusion? Is this not what a Law of Large Numbers and Law of Large Deviations tell us? I am confused by this statement. It says that the distance between them is very probably small. This is not that dierent from the kind of statements Severity testing makes. These are all probabilistic statements.

Author’s response: please read Spanos (2021a-b)

Let me summarize the kind of revisions I would like to see in this paper. This is not an exhaustive list . After such revisions are made, I may have further comments (once I see a clear exposition of the evidential interpretation of the severity curve).

Author needs to compare his evidential framework with other evidential frameworks instead of the NP testing, p-values etc.
Author needs to define his concept of evidence, strength of evidence and uncertainty quantication of the evidential statement. He should provide a desiderata on what he considers are essential elements of his concept of evidence, optimality (if it exists) and show how it relates to the post-data severity testing.
Author needs to address the conditional inference issue with the frequentist inference as it pertains to his computation of the Severity curves.
Author needs to address the issue of handling of nuisance parameters. Existence of nuisance or uninteresting or auxilliary parameters is a fact of life. How does this approach deal with nuisance parameters in a principled fashion? If the form of d(x0) is to be chosen as some test statistics used in the classical frequentist theory, it is well known that there is no single answer when there are nuisance parameters. There are UMPU tests; there are Union-Intersection tests, there are C(α) tests and so on. Is there a principle that guides the choice in the presence of nuisance parameters?

Author’s response: the above suggestions seem to be the paper the reviewer would have wanted me to write! He should write it himself/herself, and I will be happy to review it for this journal.

References:

Casella, G., and Goustis, C. (1995). Frequentist post data inference. Int. Stat. Rev. 63, 325344. Lele SR. (2020) How should we quantify uncertainty in statistical inference?. Frontiers in Ecology and Evolution. Mar 13;8:35. Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in regression analysis. the Annals of Statistics, 14(4), 1261-1295.

Reviewer 3 Report

Comments and Suggestions for Authors

See the enclosed pdf

Comments for author File: Comments.pdf

Author Response

I greatly appreciate brief comments.

Reviewer 4 Report

Comments and Suggestions for Authors

Dear editors,

this paper has 38 references, out of which 15 are by the author. Reading the paper, one has the impression its only purpose is to produce self-citations.

Next, I do not agree to the quoted statement “…most published research findings are false”, which serves as a starting point of the paper.

Next, the paper does not make a new point. Everything is built on Nyman-Pearson and its interpretation, which is completely clear by its problem setting since the 1930s.

Some technical remarks:

* The point on the efficient market hypothesis is somehow out of the context, as all other examples are all from medicine;

* formula (1) is wrong. Probably, it should read M_\Theta, but \theta cannot occur in- and outside the set. Further, the notation is used differently at some other places in the paper, for example in line 124.

* X is not bold in (1), but in the following forumula;

* \mathbb N={1,2,…}, and not \mathbb N= (1, 2, …).

* (10) is unclear.

* line 332f: here, x_k occur, the other formulas only involve X_k.

* The paper mentions figures 1, 4, etc., but I have not seen any figure.

* Funny enough, the paper addresses/ criticizes p-hacking, but addresses the case of n-dependent p-values itself.

Finally, the paper criticizes “uninformed and recipe-like, implementation” of statistical methods, but does not provide a better/ other proposal.

Author Response

Reviewer 3

Dear editors,

this paper has 38 references, out of which 15 are by the author. Reading the paper, one has the impression its only purpose is to produce self-citations.

Author’s response

This is an unfair, but understandable criticism in broad areas of research in statistics, but it is misplaced for an area largely ignored by statisticians and practitioners in most applied fields of statistical modeling and inference.

Next, I do not agree to the quoted statement “…most published research findings are false”, which serves as a starting point of the paper.

Author’s response

I’m summarizing the current state of the replication crisis literature in a few sentences and referring to the impressions created by that literature. I do not agree with Ioannidis’ diagnosis that the primary contributor to the untrustworthiness comes from abuses of frequentist testing he alludes to, and I give references that explain my view more extensively.

Next, the paper does not make a new point. Everything is built on Nyman-Pearson and its interpretation, which is completely clear by its problem setting since the 1930s.

Author’s response

I beg to disagree with the reviewer, and I make my case in several cited papers, in addition to the current manuscript.

Some technical remarks:

* The point on the efficient market hypothesis is somehow out of the context, as all other examples are all from medicine;

Author’s response

The empirical examples are chosen to illustrate an argument or a thesis, and they do not belong to one or another discipline.

Author’s response

Formula (1) is perfectly fine mathematically as is, but if it pleases the reviewer to have everything in curly brackets, I’m happy to oblige.

* X is not bold in (1), but in the following forumula;

Author’s response

x is bold in (1)

* \mathbb N={1,2,…}, and not \mathbb N= (1, 2, …).

Author’s response

The nature of the brackets used does not make any difference to the concept defined since N is a sequence and not a set, but again if it pleases the reviewer to have everything in curly brackets, I’m happy to oblige.

* (10) is unclear.

Author’s response

Since an N-P test is not just a formula (a test statistic), to define it properly one needs to bring into its definition the rejection region to indicate how one uses the test with a prespecified alpha.

To avoid any possible confusion, I combined everything into one set.

* line 332f: here, x_k occur, the other formulas only involve X_k.

Author’s response

Following established notation in the literature, random variables are indicated by capital letters, and their values are denoted by the same small letter.

* The paper mentions figures 1, 4, etc., but I have not seen any figure.

Author’s response

I apologize to the reviewer for including the graphs in a separate file. All figures are integrated into the revised manuscript.

* Funny enough, the paper addresses/ criticizes p-hacking, but addresses the case of n-dependent p-values itself.

Author’s response

I’m not sure what the reviewer means, but the paper argues that “p-hacking and all that” amount to a small component of the untrustworthiness of evidence problem. Indeed, such abuses of testing are included in [b] in the introduction. The discussion also criticizes the p-value for not providing clear evidence for any inferential claim framed in terms of the unknown parameters.

Finally, the paper criticizes “uninformed and recipe-like, implementation” of statistical methods, but does not provide a better/ other proposal.

Author’s response

I’m not sure what the reviewer means, but the discussion in the paper revolves around concrete proposals to address the “uninformed and recipe-like, implementation” of statistical methods, under [a]-[c] in the introduction. One, however, cannot do justice to all three components in a single paper. Hence, the self-citations provide a more complete picture for those readers who want to know more about [a] and [c].

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

It is obvious that the author is unwilling to look beyond his myopic ideas about statistics. He is not only unfamiliar with the vast literature in statistics about nuisance parameters but is also unwilling to learn. I see no point in wasting my time on this manuscript more than what I have already spent. The author is willing to spend substantial space on comparing his evidential approach with non-evidential approaches such as the NP testing and p-values.; however, when asked to compare with an alternative evidential approach, he is clearly balking. Maybe it says something! (Strong evidence against severity testing .. may be)

Reviewer 4 Report

Comments and Suggestions for Authors

Nothing new here. Almost all equations and results in this paper are about 100 years old.

Article Menu

How the Post-Data Severity Converts Testing Results into Evidence for or against Pertinent Inferential Claims

Further Information

Guidelines

MDPI Initiatives

Follow MDPI