Adaptations on the Use of p -Values for Statistical Inference: An Interpretation of Messages from Recent Public Discussions

: P -values have played a central role in the advancement of research in virtually all scientiﬁc ﬁelds; however, there has been signiﬁcant controversy over their use. “The ASA president’s task force statement on statistical signiﬁcance and replicability ” has provided a solid basis for resolving the quarrel, but although the signiﬁcance part is clearly dealt with, the replicability part raises further discussions. Given the clear statement regarding signiﬁcance, in this article, we consider the validity of p -value use for statistical inference as de facto . We brieﬂy review the bibliography regarding the relevant controversy in recent years and illustrate how already proposed approaches, or slight adaptations thereof, can be readily implemented to address both signiﬁcance and reproducibility, adding credibility to empirical study ﬁndings. The deﬁnitions used for the notions of replicability and reproducibility are also clearly described. We argue that any p -value must be reported along with its corresponding s-value followed by ( 1 − α ) % conﬁdence intervals and the rejection replication index.


Introduction
P-value reporting has been an issue of controversy for the different schools of thought in statistics [1]. A consensus exists pertinent to their misinterpretation though. The latter has been an issue of concern within the whole statistics community [2]. The abuse of p-value use, such as characterizing as a "trend" or "near significant" any result greater than but close to the nominal level (typically 0.05), has been extensively documented in the literature (see [3]). There is ample evidence that applied researchers misuse and misinterpret p-values in practice, and even expert statisticians are sometimes prone to misusing and misinterpreting them. The large majority are generally aware that statistical significance at the 0.05 level is a mere convention, but this convention strongly affects the interpretation of evidence [4]. General guidance regarding misinterpretations has appeared in the literature [5].
Reflection articles or short communications have appeared, trying to elaborate or constructively comment, without aphorisms, on the proper use of p-values [6][7][8][9][10][11][12][13]. Harsher criticism via reflection articles, essays, and whole books has also widely appeared [14][15][16][17][18][19], while complete rejection of its use has been proposed in essays published in highly prestigious journals [20,21]. Some authors even find it hard to decide after all [22]. Solutions proposed by prominent researchers, such as using confidence intervals instead of hypothesis tests for parameters of interest [23,24] have been criticized by others [25,26]. Adapting the alpha level in an ad hoc fashion has also been proposed [27]. The Board of Directors of the American Statistical Association (ASA) issued a Statement on Statistical Significance and p-values [28]. The ASA Statement, aimed at "researchers, practitioners, and science writers who are not primarily statisticians", and consists of six principles: 1.
p-values can indicate how incompatible the data are with a specified statistical model.

2.
p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4.
Proper inference requires full reporting and transparency.

5.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
The ASA Statement notes the following: "Nothing in the ASA statement is new. Statisticians and others have been sounding the alarm about these matters for decades, to little avail" [29].
Statistical significance often does not equate to clinical significance. Stating an example from the literature, a large trial estimates a risk ratio of 0.97 and a 95% confidence interval of 0.95 to 0.99, then the treatment effect is potentially small, even though the p-value is much lower than 0.05. Conversely, the absence of evidence does not mean evidence of absence; if a small trial estimates a risk ratio of 0.70 and a 95% confidence interval of 0.40 to 1.10, then the magnitude of effect is still potentially large, even though the p-value is greater than 0.05. As a result, statements such as "significant finding" must be less definitive overall. A Bayesian approach may be helpful to express probabilistic statements (e.g., there is a probability of 0.85 that the risk ratio is <0.9) [30]. Indeed, nicely described Bayesian formulations and analogues have appeared [31,32]. A number of researchers have proposed moving to Bayesian principles (e.g., [33]). The theoretical framework of these principles has been studied in the literature [34,35]. The advantages of using Bayesian alternatives have been discussed within a rather limited generalizability framework though [36], while standard Bayesian data-analytic measures have been shown to have the same fundamental limitations as p-values [37]. It has been documented that subjective Bayesian approaches have some hope [38] but still exhibit severe limitations [15].
Bayesian criticism has been formally described as an overreaction [39,40]. In fact, recently [41], it has been shown that under noninformative prior distributions, there is equivalence between the p-value and Bayesian posterior probability of the null hypothesis for one-sided tests and, more importantly, there is equivalence between the p-value and a transformation of posterior probabilities of the hypotheses for two-sided tests. In contrast to the common belief, such an equivalence relationship renders the p-value an explicit interpretation of how strongly the data support the null. Contrary to broad criticisms on the use of p-value in evidence-based studies, its utility is thus justified, and its importance from the Bayesian perspective is reclaimed, establishing a common ground for both frequentist and Bayesian views of statistical hypothesis testing [41].
Good decision making depends on the magnitude of effects, the plausibility of scientific explanations of the mechanisms involved, and the reproducibility of the findings by others [42]. Even in well-designed, carefully executed studies, inherent uncertainty remains, and the statistical analysis should account properly for this uncertainty [28]. Overinterpretation of very significant but highly variable p-values is an important factor contributing to the unexpectedly high incidence of non-replication [43].
In part, limitations of p-values stem from the fact that they are all too often used to summarize a complex situation with false simplicity [44]. Overall, however, there is a broad consensus that p-values are useful tools if properly used [45,46]. One needs to keep in mind the general principles of the role of statistical significance tests in the analysis of empirical data. Decision making in contexts such as medical screening and industrial inspection must be followed with an assessment of the security of conclusions [1].
A concise critical evaluation on the p-value controversy [47] offered the opportunity to expand on the subject in a special issue in the Biometrical Journal, Volume 59, Issue 5 (2017), with contributions from several renowned researchers [48]. Several articles from therein are cited in this work. Other prestigious journals have also dedicated special issues in this discussion and a complete set of guidelines on the appropriate use of p-values [49,50]. A wealth of proposals can be found in these. A large part of the overall criticism on the use of p-values appears in the relevant special issue of the American Statistician, Volume 73, Issue sup1 (2019). Overall, topics tackled therein involve the evolution of statistical significance/decision making in the hypothesis testing, interpretation and use of p-values, supplementing and replacing p-values, and other holistic approaches. A total of more than 40 articles is a useful collection; however, one may claim that a Babel tower of scientific research has thus been constructed. The lack of consensus and further solid actions appear to have hampered forward progress. Furthermore, standard textbooks still employ and support the use of the traditional p-value approach (e.g., [51]), though reference to possible extensions and adaptations can be found (e.g., [52], pp. 8-11).
Overall, from our experience, we gather that, apart from the reassuring ASA statement regarding p-values, reasons for the stagnation in terms of the better reporting results of statistical inference include that a big proportion of researchers in the subject matter literature might not even be aware of the relevant controversy and its importance. This might be the result of paywalls for the journals that took part in this dialog so far. Most importantly, possible changes also need to be implemented in widely used statistical software and introductory statistics textbooks alike; otherwise, not much can be expected to change.
The p-value seems to be here to stay. We may complement this in the need to reach firmer conclusions regarding statistical inference and to provide a reference point regarding the replicability of statistical analysis results. Subtle differences exist between the notions of reproducibility and replicability as defined in the literature. Specifically, reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as those used by the original investigator. This requires, at minimum, the sharing of data sets, relevant metadata, analytical code, and related software. Replicability refers to the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected [53,54]. We adopted this definition in this work, although the notions of replicability and reproducibility are often used interchangeably in published research. Specifically, to quantitatively account for reproducibility, in this work, we revisit the proposed approach in Boos and Stefanski (2011) [55], illustrate its use and argue that it provides a path which allows to move forward in statistical inference in a world of availability of the respective stipulated, rather minimal nowadays, computational power.
In most cases, by simply using a p-value, we ignore the scientific tenet of both replicability and reproducibility, which are important characteristics of the practical relevance of test outcomes [56]. Poor replicability from studies demonstrating significant p-values at the 5% level has been documented [57,58]. Several researchers have tried to build on the concept [59], while the need for a replicability accompanying index has also been documented [60]. The use of B-values in a two-stage testing approach has been proposed as a procedure that can improve reproducibility [61]. The adaptations studied in this article make use of bootstrapped data. Since no actual new data are sampled, we adopt "reproducibility" as the appropriate term for the described methods.
The accurate interpretation of the p-value is probably the most important advice in the relevant literature reviewed above. A number of authors have proposed complementing this for good reason, pertinent to the replication of the study findings. We argue here that complementing all p-values in any statistical analysis report in the direction of reproducibility may actively help in decision making. In fact, the use of three quantities-(i) p-value, (ii) s-value along with corresponding CIs, and (iii) the rejection replication index (rri)-as an output of any statistical testing procedure may be a prudent way to move forward, complementing the current practice.
A review of the proposed approaches and their implementation is illustrated in the Materials and Methods section that follows. Methods are illustrated in the Application section. We end with a discussion.

Materials and Methods
Statistical inference relies on p-value reporting, thus focusing on the validity of the null hypothesis. Supplementing the p-value with its Shannon information transform (s-value or surprisal), s = − log 2 (p) offers some advantages. Specifically, it actually measures as an effect size the amount of information supplied by the test against the tested hypothesis (or model). The s-value is a useful measure of the evidence against the null hypothesis, in which the larger the s-value, the more evidence against H 0 . Rounded off, the s-value shows the number of heads in a row one would need to see when tossing a coin to obtain the same amount of information against the tosses being "fair" (independent with "heads" probability of 1/2) instead of being loaded for heads. For example, if p = 0.03125, this represents −log 2 (0.03125) = 5 bits of information against the hypothesis (such as getting 5 heads in a trial of "fairness" with 5 coin tosses); and if p = 0.25, this represents only −log 2 (0.25) = 2 bits of information against the hypothesis (such as getting 2 heads in a trial of "fairness" with only 2 coin tosses). Notice that −log 2 (0.5) = 1, entailing the quantification of a single bit of information against the null [14].
As a second step, it is suggested that a bootstrap distribution of p-values must be produced (a process referring to the bootstrap prediction intervals in [55]), offering a reference for the variability of the p-value (and as a result of this of the s-value) and an estimate of the probability of the independent replication of a statistically significant result. This is simply derived by bootstrapping the dataset at hand and calculating the corresponding p-value. Transforming the bootstrapped p-values using the Shannon information transform results in a distribution for the corresponding s-values. The s-value follows an asymptotically normal distribution as formally shown in Boos and Stefanski (2011) [55]. The distribution of the s-value, being more symmetric than its p-value counterpart, offers a better reference for effect sizes against the null hypothesis. Confidence intervals for the s-value can then be readily constructed using any bootstrap variant of choice. The replication rejection index (rri) is simply defined as the number of times H 0 is rejected based on the bootstrap distribution of the p-value at a given significance level.
Given the computational burden involved, the proposed practice might have been difficult to follow in the early days of applied statistics; however, it is not an issue of even remote concern nowadays.

Revisiting a Real-World Published Application
We illustrate the described adjustment/procedure on p-value use analyzing publicly available data (http://dx.doi.org/10.13140/RG.2.2.22805.96482 (accessed on 10 March 2023)) from a study evaluating the utility of biomarkers sTREM-1 and IL-6 (the latter is an established marker) for the detection of late-onset sepsis (LoS) in neonates [62]. We revisit the analysis therein, supplementing p-values with corresponding s-values and their confidence intervals, and the rri.
In general, neonates that develop LoS have elevated values for the two biomarkers ( Figure 1). ROC curve analysis has been used to assess the accuracy of the two biomarkers and to formally compare them using the AUC [63].   Table 1. The analogy of the s-value size relative to coin tossing described in the previous section gives the big picture. The relevant bootstrap s-value distributions are illustrated in Figure 2. Specifically, we estimate that for sTERM-1, there are seven bits against the null hypothesis, while for IL-6, there are 21. Each bit represents the number of "heads" in a row that one would need to see when tossing a coin to obtain the same amount of information against the tosses being "heads" with probability equal to 50%.
We conclude that IL-6 consistently discriminates between noninfected (controls) and infected (cases) neonates (illustrated in Figure 2b). The estimated rri can be regarded as evidence that one can always replicate the significance of IL-6 as a diagnostic marker for LoS, while this is true most of the time (about 80%) for sTREM-1. The latter can thus be considered a useful biomarker (illustrated in Figure 2a). The formal comparison of AUCs shows that IL-6 is, in general, a better biomarker, but this evidence is not overwhelming given the rri = 0.441, suggesting that the reproducibility of a significant difference between IL6 and sTREM-1 can be expected in about 45% of replications of the experiment. The corresponding bootstrap distribution of the s-values is illustrated in Figure 2c. A more complete assessment is thus provided relative to a reference to a "marginally significant" result of p = 0.0534.

Discussion
Statistics quantify uncertainty. We should be skeptical of peddling impossible guarantees, rather than demanding them, and celebrate those who tell us about risk and imprecision [66]. One could claim that most statisticians would agree that the answer to both questions below is a clear 'No' [67]:

1.
Is hypotheses testing (or some other approach to binary decision making) unsuitable as a methodological backbone of the empirical, inductive sciences? 2.
Should p-values (or Bayesian analogs of them) be banned as a basic tool of statistical inference?
Citing Wellek (2017) [47], the logic behind statistical hypotheses testing is not free of elements deemed artificial and, in line with this, it becomes difficult to grasp for many applied researchers. Perhaps the most conspicuous of these features is the intrinsic asymmetry of the roles played by the two hypotheses making up a testing problem: the possibility of being rejected subject to a known bound to the risk of taking a wrong decision and hence of confirming its logical counterpart (typically its complement) exists only for H 0 . In fact, there is no need to even state an alternative hypothesis in the framework of classical hypothesis testing (see [68], p. 9 and [39]), and important textbooks avoid even simple reference to alternative hypotheses [68,69]. Rather than having formal rules for when to reject the null model, it has been suggested that one can just report the evidence against the null model [69]. However, the notion of an alternative hypothesis is fundamental in the study design and sample size calculation as a first step in an experimental study and can be useful in deciphering the null from the alternative hypothesis distributions given the data. Specifically, the distribution function of a test statistic under the alternative hypothesis is equivalent to the ROC curve of the distribution of the corresponding p-value under the null hypothesis vs. the distribution of p under the alternative [70,71]. Theoretical and specific technical properties have been detailed [55,[72][73][74]. This equivalence is also spotted in ROC-related research [75][76][77]. As a result, the use of ROC-based criteria for general statistical inference can also offer a wide area of future research on the topic.
Other concepts trying to replace or complement the traditional p-value have appeared in the literature [12,49,70,71,[78][79][80][81][82][83][84][85][86][87][88][89][90][91]. The wealth of possible options has probably left a puzzled community. Although the problem has been rather "easy to spot", conflicting solutions have appeared. Among the proposed solutions, one could specifically mention second-generation p-values [92], which have also been supported with implementation options [93]; the calibration of p-values which has, however, been developed in a rather limited framework [94]; the fragility index, i.e., the minimum required number of patients, whose status would have to change from a nonevent to an event, to turn a statistically nonsignificant result to a significant result, with smaller numbers indicating a more fragile result [95]. The simultaneous testing of superiority, equivalence and inferiority has also been studied and proposed [96,97]. Another idea that has been developed in a limited framework is the replication probability approach [98,99].
Leaving the controversy aside, the p-value is gaining even further use in methodological development and decision-making approaches [100][101][102][103][104][105][106][107] as it had done so already in the past [108,109]. However, does it suffice to settle with, all is well? Obviously the lack of consensus does not help much with moving forward [110,111]. Much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty and replicability [28]. In this regard, the NEJM adapted their guidelines. The journal's revised policies on p-values rest on three premises: it is important to adhere to a prespecified analysis plan if one exists; the use of statistical thresholds for claiming an effect or association should be limited to analyses for which the analysis plan outlines a method for controlling type I errors; and the evidence about the benefits and harms of a treatment or exposure should include both point estimates and their margins of error [112]. All in all, there is nothing wrong with p-values as long as they are used as intended [39,113]. Thoughts on the difficulties envisioned in a possible effort to abandon the use of p-values have been concisely described [114].
We illustrated in this work both the implementation of the s-value approach that can help provide more complete inference in terms of significance and rri that can help address the reproducibility issue in empirical study findings. rri can also be considered a frequentist analogue to the predictive probability of success as given in [115]. The use of a jackknife approach for the calculation of confidence intervals for the p-value itself has been proposed [116], but the corresponding distribution is typically skewed [55], giving an advantage to the accompanying s-value interpretation. Implementation for s-values has also appeared elsewhere [117,118]. The addition of confidence intervals to the corresponding s-value will also most probably affect the notorious practice of p-value hacking [119,120], making published results more reliable. The proposed inferential procedure can be readily used when a study involves multiplicity adjustments, statistical model building, etc., one limitation being the extra burden asked of researchers to accurately calculate indices and CIs, and further interpret results.
The proposed bootstrap approach may present limitations pertinent to the dependence on the observed data, sample size limitations, and sensitivity to model assumptions. Specifically, we assume that the observed data are a representative sample of the population, a sufficiently large sample size is used to generate stable bootstrap estimates, and the observed data provide unbiased bootstrap estimates. These are inherent limitations of the bootstrap per se, and the study of alternative approaches is a topic of further research.
Thresholds are helpful when actions are required. The discussion of p-values is so extensive because it has been expanded to cover the very broad and challenging issues of quantifying the strength of the evidence from a study and of deciding whether the evidence is adequate for a decision. Comparing p-values to a significance level can be useful, though p-values themselves provide valuable information. p-values and statistical significance should be understood as assessments of the observations or effects relative to sampling variation, and not necessarily as measures of practical significance. If thresholds are deemed necessary as a part of decision making, they should be explicitly defined based on study goals, considering the consequences of incorrect decisions. Conventions vary by discipline and purpose of analyses [28]. Indeed, following the instructions of the voices of harsh criticism or complete rejection implies changing things radically regarding how research is conducted. As a result we will be intensely interfering with other people's jobs, affecting a huge industry of asset allocation and scientific process. The routine of conducting statistical analysis cannot change radically since alternatives have not been widely convincing [121].
Complementing p-values in order to make better decisions seems reasonable, and there are various approaches for doing so [122]. Reproducibility and replicability are probably reasonable complements as a central scientific goal [54,123]. A small revolution might be a consensus between scientific societies and major statistical software developers on specific solutions, such as the one illustrated in this work, that will be widely implemented and will-by default--be introduced in everyday practice. Textbook authors will follow along. An unavoidable evolution will be reached. Acknowledgments: The authors are grateful to Constantine Gatsonis for all the fruitful discussions on the topic and would like to thank the reviewers for the insightful comments.

Conflicts of Interest:
The authors declare no conflict of interest.