1. Introduction
In scientific research, the
p-value has long been the gatekeeper of statistical significance, distinguishing the signal from the noise. Traditionally, a
p-value less than 0.05 has been the threshold for deeming results significant. Yet, this binary approach often fails to consider the nuances of statistical power and the potential for large sample sizes to show statistical significance despite trivial treatment effects [
1]. Assigning an arbitrary, fixed
p-value threshold of 0.05 has been thought to be a significant reason for the reproducibility crisis in medicine [
2].
To address these reproducibility concerns, researchers have developed various measures of statistical fragility for 2 × 2 contingency tables, such as the unit fragility index (UFI) [
3], fragility index (FI) [
4], fragility quotient (FQ) [
5], and percent fragility index (pFI) [
6]. These measures aim to quantify the robustness of research findings by assessing the impact of small changes in data on statistical significance. However, all of these measures require the modification of the underlying distribution of the raw data. They measure how stable a
p-value threshold of 0.05 is to alterations in the underlying distribution of the research data. While the UFI and FI both increase with an increasing sample size, the pFI and FQ quantify the statistical fragility while considering changes in the sample size.
The robustness index (RI) takes a different approach to statistical fragility [
7]. It also looks at the stability of a
p-value threshold of 0.05. However, unlike the UFI, FI, FQ, and pFI, it maintains the integrity of the distribution of the underlying data. The RI does not require any manipulation of data distribution but instead looks at changes in the
p-value as the sample size increases or decreases. The distribution of the study outcomes remains fixed; only the sample size changes. The RI thus allows for a more standardized way to compare research studies with different sample sizes. By maintaining the integrity of raw data distributions and not artificially manipulating data, the RI can fill a critical gap in the current knowledge of statistical fragility and improve the reproducibility of medical research.
This perspective article demonstrates how each primary statistical fragility metric is calculated. Two case examples show how these various fragility metrics are applied. Finally, the clinical application and importance of the fragility metrics are discussed.
3. Clinical Interpretation of Fragility Metrics
There are currently no widely agreed-upon cutoff values for any of these fragility metrics to categorize statistical findings as fragile or robust. Given the high correlation of the fragility indices with the
p-value [
15], there is an ongoing debate about whether utilizing these metrics of fragility is valuable [
16]. Nevertheless, from an intuitive standpoint, the fragility metrics can help us understand the data better. For example, an RI of 3 indicates that if the sample size was increased by a factor of 3, a nonsignificant finding would become significant. Similarly, the pFI is readily understandable; for instance, a pFI of 4% indicates that a 4% change in the outcomes in any direction (e.g., due to random events, miscategorization, or loss to follow-up) could potentially flip the statistical significance.
3.1. Interpretation of the Fragility Indices
Although no specific cutoff values have been established, it is generally considered that when the FI or UFI is less than the number of research subjects lost to follow-up, the findings are fragile and should be interpreted cautiously [
17].
3.2. Interpretation of the Fragility Quotients
Again, no established cutoff values exist for any of the various fragility quotients. However, using a similar logic to that utilized for the p-value, a quotient of less than 5% could perhaps suggest statistical fragility. However, because the data manipulation involved in calculating these indices is unidirectional, it is suggested that a quotient of 3% or less may be more appropriate to indicate fragility. Given the non-uniform data manipulation of the FQ (only the outcomes of the intervention group are altered) and the coarse incrementation of both the FQ and UFQ, it is suggested that the pFI is the most reproducible and most conservative quotient in assessing the statistical fragility across a broad range of biomedical research applications.
3.3. Interpretation of the Robustness Index
While no established cutoff values exist regarding the RI, it is suggested that an RI of 2 or less indicates statistical fragility, a value of 2 to 5 indicates intermediate fragility, and a value of 5 or greater is consistent with statistical robustness. Regardless of whether the original findings are significant or nonsignificant, as the RI increases, the robustness of the findings increases. In all cases, a small RI suggests fragility and a large RI suggests robustness.
4. Discussion
There is a reproducibility crisis in biomedical research, with seemingly important research findings discovered to be either false or questionable when verification studies are performed [
18]. Multiple factors contribute to the irreproducibility of research. One factor is publication bias, where researchers face pressure to publish positive and novel findings, leading to the selective reporting of results. Another issue is a lack of transparency in reporting study methods, data, and analyses and the unavailability of raw data [
19]. However, even if these issues were fully addressed, there would still be ongoing reproducibility problems due to inappropriate statistical analyses [
20].
The American Statistical Association (ASA) released a statement in 2016 addressing the misuse and misinterpretation of
p-values in scientific research [
21]. The statement emphasizes that
p-values alone do not provide a good measure of evidence regarding a model or hypothesis and that scientific conclusions should not be based solely on whether a
p-value passes a specific threshold. The ASA also highlights the importance of proper inference, full reporting, and transparency in research. The concerns raised in the ASA statement align with the issues discussed in this manuscript, particularly the need for more comprehensive statistical tools to assess the robustness of research findings.
Including a fragility analysis as a routine statistical procedure could potentially address the reproducibility crisis and increase research efficacy. Small sample sizes do not always indicate poor research or low reproducibility. Research on large sample sizes often spuriously finds minute population differences with no clinical value. The RI and Fragility Quotients (in particular, the pFI) could help address these concerns with both small and large sample sizes.
Currently, the FI and FQ are the primary metrics being utilized to evaluate the fragility or robustness of biomedical research. While the concept of the FI is intuitively easy to grasp, it has significant limitations. First of all, it depends upon a strict definition of an intervention and a control group, so it cannot be consistently applied to comparison studies with no distinct intervention group. In comparison studies, if the two groups are switched, the FI can change and it will not be consistent in its results. Furthermore, in some situations, the FI simply will not result in flipping the statistical significance of a study, such as demonstrated in the second case study above. Although the FI and FQ have become widely utilized, these insurmountable numerical challenges suggest that better fragility metrics are necessary. On the other hand, the RI, pFI, and/or the UFI/UFQ appear more stable and widely applicable. More investigation into these metrics of fragility should be explored, as well as the development of additional tools to assess the fragility of statistical tests beyond the 2 × 2 contingency table.
The ASA’s 2016 statement on p-values underscores the need for alternative approaches to assess the strength of scientific evidence. The statement encourages the use of other methods, such as confidence intervals, Bayesian methods, and false discovery rates, which can provide more direct information about the size of an effect or the correctness of a hypothesis. The RI and pFI, as discussed in this manuscript, offer additional tools to evaluate the fragility of research findings and complement the recommendations made in the ASA statement.
The RI provides a distinctly unique approach to statistical fragility compared to the fragility indices. First of all, it does not alter the distribution of the data but assumes that the data presented are as accurate as possible. It assumes that the data distribution found by the researchers is the best possible estimate. All the other metrics of fragility engage in data manipulation in one direction only. This assumes that any missing data have a unidirectional rather than random bias. The RI, however, does not make any such assumptions. It is a purely numerical process that utilizes the well-recognized fact that large sample sizes are much more likely to contain statistically significant findings, even when the differences are small and clinically meaningless. Nevertheless, a more rigorous and thorough evaluation of the RI along with other metrics of statistical fragility is necessary.
It is unlikely that the
p-value will be abandoned, in spite of numerous concerns regarding its arbitrary approach to defining significance. A
p-value of 0.05 or less has become deeply embedded in the biomedical literature, including in the approval process for new pharmaceuticals [
22]. Thus, it is recommended that at a minimum, the statistical analyses in research studies include a fragility analysis using a standardized fragility metric that can be widely applied and compared across different types of studies, outcomes, and sample sizes.
While the ASA’s 2016 statement does not call for the complete abandonment of p-values, it does emphasize the need for researchers to recognize their limitations and to use them in conjunction with other statistical tools. The inclusion of fragility metrics in the statistical analysis of research studies aligns with the ASA’s recommendations for a more comprehensive and nuanced approach to statistical inference.
One alternative to the
p-value is the relative risk index, which represents the residual of a contingency table divided by the sample size [
15]. While this may ultimately be more applicable to clinicians deciding upon the management of individual patients, its novelty and lack of standardization currently make it unsuitable to replace the
p-value.
In fields where the stakes of research outcomes are high, such as medicine or public policy, fragility metrics may become an invaluable component of the analytical arsenal. They allow decision-makers to discern between results genuinely indicative of an underlying phenomenon and those that may be statistical mirages. In doing so, fragility metrics underpin a more responsible form of data-driven decision-making that recognizes the complexity and conditional nature of statistical evidence. Fragility metrics invite scientists and practitioners to evaluate statistical findings more critically and to view research results with an appropriate but not excessive degree of skepticism.
5. Conclusions
Metrics of fragility, particularly the RI and pFI, are objective tools to assess the statistical fragility of biomedical research. The pFI has the advantage over older fragility metrics because it is more precise and widely applicable than the UFI and FI. Additionally, it is less biased than the FI because it is uniformly applied to all outcomes, not selectively. In cases with no clear intervention group, the FI can lead to inconsistent results, suggesting an underlying instability of this metric. While the FI, UFI, and pFI all perform a unidirectional shift in the outcomes data, the RI does not alter the raw data but analyzes it in a way that simply considers the influence of sample sizes. Overall, current metrics of statistical fragility would benefit from further evaluation regarding how they perform and compare across a wide variety of research studies.
Incorporating fragility indices into research practices can be seen as a step toward a more mature phase of statistical reasoning, where significance is not just a matter of crossing a threshold but a multi-faceted and contextually informed judgment.