# A Frequentist Alternative to Significance Testing, p-Values, and Confidence Intervals

## Abstract

**:**

## 1. A Frequentist Alternative to Significance Testing, p-Values, and Confidence Intervals

## 2. Discontent with Significance Testing, p-Values, and Confidence Intervals

#### 2.1. Significance Testing

#### 2.2. p-Values without Significance Testing

#### 2.3. Confidence Intervals

#### 2.4. Bayesian Thinking

## 3. The A Priori Procedure (APP)

- $f$ is the fraction of a standard deviation the researcher defines as sufficiently “close,”
- ${Z}_{C}$ is the z-score that corresponds to the desired probability of being close, and
- $n$ is the minimum sample size necessary to meet specifications for closeness and confidence.

#### 3.1. $j$ Groups

- $j$ is the number of groups,
- $p\left(jmeans\right)$ is the probability of meeting the closeness specification with respect to the $j$ groups,
- and ${\mathsf{\Phi}}^{-1}$ is the inverse of the cumulative distribution function of the normal distribution.

#### 3.2. Differences in Means

#### 3.3. Skew-Normal Distributions

#### 3.4. Limitations

#### 3.5. APP versus Power Analysis

#### 3.6. The Relationship between the APP and Idealized Replication

#### 3.7. Criteria18

## 4. Conclusions

## Funding

## Conflicts of Interest

## References

- Armhein, Valentin, David Trafimow, and S. Sander Greenland. 2019. Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician 73: 262–70. [Google Scholar]
- Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E.-J. Wagenmakers, Richard Berk, Kenneth A. Bollen, Björn Brembs, Lawrence Brown, and Colin Camerer. 2018. Redefine statistical significance. Nature Human Behavior 33: 6–10. [Google Scholar] [CrossRef]
- Berk, Richard. A., and David A. Freedman. 2003. Statistical assumptions as empirical commitments. In Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger, 2nd ed. Edited by Thomas G. Blomberg and Sheldon Cohen. New York: Aldine de Gruyter, pp. 235–54. [Google Scholar]
- Blanca, Maria, Jaume J. Arnau, Dolores López-Montiel, Roser Bono, and Rebecca Bendayan. 2013. Skewness and kurtosis in real data samples. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences 9: 78–84. [Google Scholar] [CrossRef]
- Box, George E. P., and Norman R. Draper. 1987. Empirical Model-Building and Response Surfaces. New York: John Wiley and Sons. [Google Scholar]
- Cartwright, Nancy. 1983. How the Laws of Physics Lie. Oxford: Oxford University Press. [Google Scholar]
- Carver, Ronald P. 1993. The case against statistical significance testing, revisited. Journal of Experimental Education 61: 287–92. [Google Scholar] [CrossRef]
- Cohen, Jacob. 1994. The earth is round (p < 0.05). American Psychologist 49: 997–1003. [Google Scholar]
- Cumming, Geoff, and Robert Calin-Jageman. 2017. Introduction to the New Statistics: Estimation, Open Science, and Beyond. New York: Taylor and Francis Group. [Google Scholar]
- Fisher, Ronald A. 1973. Statistical Methods and Scientific Inference, 3rd ed. London: Collier Macmillan. [Google Scholar]
- Gillies, Donald. 2000. Philosophical Theories of Probability. London: Taylor and Francis. [Google Scholar]
- Good, Irving J. 1983. Good Thinking: The Foundations of Probability and Its Applications. Minneapolis: University of Minnesota Press. [Google Scholar]
- Greenland, Sander. 2019. The Unconditional Information in p-values, and Its Refutational Interpretation via S-values. The American Statistician 73: 106–14. [Google Scholar] [CrossRef]
- Grice, James W. 2017. Comment on Locascio’s results blind manuscript evaluation proposal. Basic and Applied Social Psychology 39: 254–55. [Google Scholar] [CrossRef]
- Halsey, Lewis G., Douglas Curran-Everett, Sarah L. Vowler, and Gordon B. Drummond. 2015. The fickle P value generates irreproducible results. Nature Methods 12: 179–85. [Google Scholar] [CrossRef]
- Ho, Andrew D., and Carol C. Yu. 2015. Descriptive statistics for modern test score distributions: Skewness, kurtosis, discreteness, and ceiling effects. Educational and Psychological Measurement 75: 365–88. [Google Scholar] [CrossRef]
- Hyman, Michael. 2017. Can ‘results blind manuscript evaluation’ assuage ‘publication bias’? Basic and Applied Social Psychology 39: 247–51. [Google Scholar] [CrossRef]
- Kim, Jae H., and Philip I. Ji. 2015. Significance testing in empirical finance: A critical review and empirical assessment. Journal of Empirical Finance 34: 1–14. [Google Scholar] [CrossRef]
- Kline, Rex. 2017. Comment on Locascio, results blind science publishing. Basic and Applied Social Psychology 39: 256–57. [Google Scholar] [CrossRef]
- Locascio, Joseph. 2017a. Results blind publishing. Basic and Applied Social Psychology 39: 239–46. [Google Scholar] [CrossRef] [PubMed]
- Locascio, Joseph. 2017b. Rejoinder to responses to “results blind publishing”. Basic and Applied Social Psychology 39: 258–61. [Google Scholar] [CrossRef] [PubMed]
- Marks, Michael J. 2017. Commentary on Locascio. Basic and Applied Social Psychology 39: 252–53. [Google Scholar] [CrossRef]
- Melton, Arthur. 1962. Editorial. Journal of Experimental Psychology 64: 553–57. [Google Scholar] [CrossRef]
- Micceri, Theodore. 1989. The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin 105: 156–66. [Google Scholar] [CrossRef]
- Michelson, Albert A., and Edward W. Morley. 1887. On the relative motion of earth and Luminiferous ether. American Journal of Science, Third Series 34: 333–45. [Google Scholar] [CrossRef]
- Nickerson, Raymond S. 2000. Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods 5: 241–301. [Google Scholar] [CrossRef]
- Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349: aac4716. [Google Scholar] [CrossRef][Green Version]
- Paolella, Marc S. 2018. Fundamental Statistical Inference: A Computational Approach. Chichester: John Wiley and Sons. [Google Scholar]
- Trafimow, David. 2003. Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes’s theorem. Psychological Review 110: 526–35. [Google Scholar] [CrossRef] [PubMed]
- Trafimow, David. 2005. The ubiquitous Laplacian assumption: Reply to Lee and Wagenmakers. Psychological Review 112: 669–74. [Google Scholar] [CrossRef]
- Trafimow, David. 2017. Using the coefficient of confidence to make the philosophical switch from a posteriori to a priori inferential statistics. Educational and Psychological Measurement 77: 831–54. [Google Scholar] [CrossRef] [PubMed]
- Trafimow, David. 2018a. An a priori solution to the replication crisis. Philosophical Psychology 31: 1188–214. [Google Scholar] [CrossRef]
- Trafimow, David. 2018b. Confidence intervals, precision and confounding. New Ideas in Psychology 50: 48–53. [Google Scholar] [CrossRef]
- Trafimow, David. 2019a. My ban on null hypothesis significance testing and confidence intervals. In Structural Changes and Their Economic Modeling. Edited by Vladik Kreinovich and Songsak Sriboonchitta. Cham: Springer, pp. 35–48. [Google Scholar]
- Trafimow, David. 2019b. What to do instead of null hypothesis significance testing or confidence intervals. In Beyond Traditional Probabilistic Methods in Econometrics. Edited by Vladik Kreinovich, Nguyen Ngoc Thack, Nguyen Duc Trung and Dang Van Thanh. Cham: Springer, pp. 113–28. [Google Scholar]
- Trafimow, David, and Michiel de Boer. 2018. Measuring the strength of the evidence. Biomedical Journal of Scientific and Technical Research 6: 1–7. [Google Scholar] [CrossRef]
- Trafimow, David, and Brian D. Earp. 2017. Null hypothesis significance testing and Type I error: The domain problem. New Ideas in Psychology 45: 19–27. [Google Scholar] [CrossRef]
- Trafimow, David, and Justin A. MacDonald. 2017. Performing inferential statistics prior to data collection. Educational and Psychological Measurement 77: 204–19. [Google Scholar] [CrossRef]
- Trafimow, David, and Hunter A. Myüz. Forthcoming. The sampling precision of research in five major areas of psychology. Behavior Research Methods.
- Trafimow, David, and Stephen Rice. 2009. What if social scientists had reviewed great scientific works of the past? Perspectives in Psychological Science 4: 65–78. [Google Scholar] [CrossRef]
- Trafimow, David, Valentin Amrhein, Corson N. Areshenkoff, Carlos J. Barrera-Causil, Eric J. Beh, Yusef K. Bilgic, Roser Bono, Michael T. Bradley, William M. Briggs, Héctor A Cepeda-Freyre, and et al. 2018. Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology 9: 699. [Google Scholar] [CrossRef] [PubMed]
- Trafimow, David, Tonhui Wang, and Cong Wang. 2019. From a sampling precision perspective, skewness is a friend and not an enemy! Educational and Psychological Measurement 79: 129–50. [Google Scholar] [CrossRef] [PubMed]
- Trafimow, David, Cong Wang, and Tonghui Wang. Forthcoming. Making the a priori procedure (APP) work for differences between means. Educational and Psychological Measurement.
- Trafimow, David. Forthcoming. A taxonomy of model assumptions on which P is based and implications for added benefit in the sciences. International Journal of Social Research Methodology. [CrossRef]
- Wang, Cong, Tonghui Wang, David Trafimow, and Hunter A. Myüz. 2019a. Desired sample size for estimating the skewness under skew normal settings. In Structural Changes and Their Economic Modeling. Edited by Vladik Kreinovich and Songsak Sriboonchitta. Cham: Springer, pp. 152–62. [Google Scholar]
- Wang, Cong, Tonghui Wang, David Trafimow, and Xiaoting Zhang. 2019b. Necessary sample size for estimating the scale parameter with specified closeness and confidence. International Journal of Intelligent Technologies and Applied Statistics 12: 17–29. [Google Scholar]
- Wasserstein, Ronald. L., and Nicole A. Lazar. 2016. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70: 129–33. [Google Scholar] [CrossRef]

1 | This is an oversimplification. In fact, the p-value is computed from a whole model, which includes the null hypothesis as well as countless inferential assumptions. That the whole model is involved in computing a p-value will be addressed carefully later. For now, we need not consider the whole model to bring out the logical issue at play. |

2 | Also see Kim and Ji (2015) for Bayesian calculations pertaining to significance tests in empirical finance. |

3 | The interested reader should consult the larger discussion of the issue in the pages of Basic and Applied Social Psychology (Grice 2017; Hyman 2017; Kline 2017; Locascio 2017a, 2017b; Marks 2017). |

4 | As will become clear later, this is not a valuable exercise; but the pretense is nevertheless useful to make an important point about logarithmic transformations of p-values. |

5 | Sometimes p-value apologists admit p-value unreliability but point out that such unreliability has been known from the start. Although this contention is correct, it fails to justify p-values. That a procedure has been known from the start to be unreliable does not justify its use! |

6 | This correlation, rounded to 0.004, was mentioned by Trafimow and de Boer (2018); but these researchers did not assess transformed p-values. |

7 | To understand why, consider that CIs are based largely on the standard error. In turn, the standard error is based on the standard deviation and the sample size. Finally, the standard deviation is influenced by random measurement error but also by systematic differences between people. Thus, the standard deviation in the numerator of the standard error calculation is influenced by both measurement precision and precision of homogeneity; and the denominator of the standard error includes the sample size, thereby implicating the importance of sampling precision. Thus, all three types of precision influence the standard error. This triple confound is problematic for interpreting CIs. |

8 | |

9 | Because participants do not come in fractions, it is customary to round upwards to the nearest whole number. |

10 | I thank an anonymous reviewer for pointing out that, due to the lack of cutoff points, this limitation can be considered a strength. According to the reviewer, “There is no cutoff point, so potentially all estimates could be viable.” |

11 | For those who prefer CIs, an alternative goal would be to find the number of participants required to obtain sample CIs of desired widths. |

12 | For elaborated mathematical discussions of the differences, see Trafimow and Myüz (Forthcoming) and Trafimow (2019b). |

13 | See (Trafimow and Myüz (forthcoming) for details. |

14 | Michelson received his Nobel Prize in 1907. |

15 | It is interesting that Carver (1993) reanalyzed the data using NHST and obtained a statistically significant effect due to the large number of data points. As Carver pointed out, had Michelson and Morley used NHST, the existence of the luminiferous ether would have been supported, with incalculable consequences for physics (also see Trafimow and Rice 2009). |

16 | A counter might be to use equivalence testing; but this is extremely problematic because it involves the computation of at least two p-values, whereas we already have seen that even one p-value is problematic. |

17 | If specifications are not met in one of the two studies, that constitutes a failure to replicate. |

18 | I thank an anonymous reviewer for suggesting this issue. |

19 |

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Trafimow, D. A Frequentist Alternative to Significance Testing, *p*-Values, and Confidence Intervals. *Econometrics* **2019**, *7*, 26.
https://doi.org/10.3390/econometrics7020026

**AMA Style**

Trafimow D. A Frequentist Alternative to Significance Testing, *p*-Values, and Confidence Intervals. *Econometrics*. 2019; 7(2):26.
https://doi.org/10.3390/econometrics7020026

**Chicago/Turabian Style**

Trafimow, David. 2019. "A Frequentist Alternative to Significance Testing, *p*-Values, and Confidence Intervals" *Econometrics* 7, no. 2: 26.
https://doi.org/10.3390/econometrics7020026