Next Article in Journal
Killing and 2-Killing 1-Forms on Poisson Doubly Warped Product Manifolds
Previous Article in Journal
A Multi-Stage NLP Framework for Knowledge Discovery from Crop Disease Research Literature
 
 
Due to scheduled maintenance work on our servers, there may be short service disruptions on this website between 11:00 and 12:00 CEST on March 28th.
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Statistical Hypothesis Testing: A Comprehensive Review of Theory, Methods, and Applications

Faculty of Economics and Business, University of Belgrade, 11000 Belgrade, Serbia
Mathematics 2026, 14(2), 300; https://doi.org/10.3390/math14020300
Submission received: 21 December 2025 / Revised: 11 January 2026 / Accepted: 13 January 2026 / Published: 14 January 2026
(This article belongs to the Section D1: Probability and Statistics)

Abstract

Statistical hypothesis testing is a foundational concept in inferential statistics, providing formal mechanisms for evaluating evidence and making decisions. This review article provides a comprehensive overview of the principles of hypothesis testing, including classical frameworks such as the Neyman–Pearson paradigm and Fisher’s significance testing, as well as modern perspectives including Bayesian approaches, robust testing, and high-dimensional inference. The article synthesizes key theoretical underpinnings, outlines commonly used statistical tests, and discusses methodological challenges and criticisms. Applications across various fields—ranging from medicine and economics to machine learning and psychology—are reviewed to demonstrate the versatility and evolving role of hypothesis testing in contemporary research. Finally, current debates and future research directions are highlighted, emphasizing the need for more robust, transparent, and reproducible statistical practices.

1. Introduction

Statistical hypothesis testing plays a fundamental role in modern scientific research, serving as a central tool for drawing inferences from data. First introduced in the early 20th century by pioneers such as Ronald A. Fisher [1], Jerzy Neyman and Egon Pearson [2], hypothesis testing has since evolved into a critical component of inferential statistics. It enables researchers to formally assess whether observed data are consistent with a proposed model or whether deviations are statistically significant.
Statistical hypothesis testing has long been a cornerstone of empirical research, providing a structured framework to quantify uncertainty and draw conclusions from data. Classical approaches, such as Fisher’s significance testing and the Neyman–Pearson framework, laid the foundation for modern inferential statistics. However, with the advent of high-dimensional data, machine learning applications, and large-scale experimental studies, traditional methods face new challenges, including inflated false discovery rates, non-standard test statistics, and complex dependency structures.
This article aims to provide a comprehensive and integrative perspective on hypothesis testing by:
  • Synthesizing classical and modern approaches, including Fisherian, Neyman–Pearson, Bayesian, resampling, and high-dimensional methods.
  • Offering a critical comparison of methodologies, highlighting their theoretical strengths, limitations, and practical applicability in contemporary data-rich environments.
  • Bridging theory and application, with detailed discussions on high-dimensional mean testing, permutation and bootstrap techniques, and model-agnostic inference in machine learning.
  • Identifying open research questions and unresolved issues, particularly in the context of reproducibility, p-value controversies, and statistical inference in complex models.
Unlike traditional textbooks or standard survey articles, this work emphasizes critical synthesis, comparative evaluation, and integration of recent theoretical results, providing a roadmap for both researchers and practitioners navigating classical and emerging methodologies in statistical hypothesis testing.
The classical approach to hypothesis testing typically involves formulating a null hypothesis (H0), representing a default assumption or status quo, and an alternative hypothesis (H1), representing the presence of an effect or difference. By comparing test statistics derived from sample data to theoretical distributions, researchers can compute p-values and make decisions regarding the rejection or non-rejection of H0.
Despite its widespread adoption across disciplines such as medicine, economics, psychology, and engineering, hypothesis testing is not without criticism. Concerns related to the misuse of p-values, the replication crisis, and statistical power have prompted statisticians and scientists to re-examine foundational principles and propose alternative methods, including Bayesian approaches and robust nonparametric techniques.
This review article aims to provide a comprehensive and structured overview of hypothesis testing, including its theoretical foundations, commonly used methods, and practical applications. We also address current debates and emerging trends, particularly in high-dimensional data settings and interdisciplinary research. By synthesizing classical and modern perspectives, this paper intends to serve as a resource for both newcomers and experienced researchers seeking a deeper understanding of hypothesis testing in statistical inference.

2. History of Hypothesis Testing

Before hypothesis testing became formalized as a statistical method, researchers relied on intuition, experience, and methods that were more focused on descriptive statistics. At that time, statistics was mainly a tool for collecting and organizing data, without rigorous methods for testing theories and hypotheses.
The conceptual roots of hypothesis testing can be traced to early empirical procedures such as the Trial of the Pyx (judicial ceremony) in medieval England. In this annual examination, coins produced by the Royal Mint were randomly sampled and compared with the expected weight of precious metals. If significant discrepancies were observed, penalties followed. This procedure embodied the core logic of modern hypothesis testing: establishing a hypothesis, drawing a random sample, comparing observed outcomes with theoretical expectations, and making inferences based on deviations from the null assumption. The logical foundation of testing hypotheses, often viewed as a proof by contradiction, has deep philosophical origins.
Experiments attributed to Galileo challenged Aristotle’s hypothesis on falling bodies by providing empirical evidence inconsistent with theoretical expectations. This reasoning—attempting to falsify rather than prove a statement—anticipates the structure of modern significance testing.
Early statistical applications appeared in the eighteenth century. John Arbuthnot [3] analyzed birth records in London and argued that the consistent predominance of male births over female ones could not be explained by chance, proposing it as evidence of divine providence. Similarly, John Michell [4] examined the distribution of stars and concluded that their apparent clustering was unlikely under randomness, thereby introducing inferential reasoning in astronomy.
Several scientists used statistical techniques to analyze probability in the 19th century, but formal methods of hypothesis testing did not exist.
Karl Pearson [5] developed basic statistical tools such as the chi-square test and correlation. His achievements were an important foundation for the further development of statistics. William Gosset [6] developed the t-test for small samples, providing one of the first exact tests of significance.
One of the most important moments in the history of hypothesis testing was the work of Ronald A. Fisher, an English statistician who pioneered the foundations of modern statistical testing. In the 1920s, Fisher [1] formulated several key principles in statistics:
  • Fisher’s probability method: Fisher introduced the concept of probability in the context of hypotheses, laying the foundation for today’s understanding of p-values. He recommended the use of the p-value as an indicator of the probability of obtaining a result consistent with the null hypothesis.
  • Fisher’s test: Basic statistical tests were developed, such as the F-test for analysis of variance, which enabled formalized testing of hypotheses about population parameters.
  • p-value: Fisher popularized the use of the p-value as a tool for deciding whether to reject or fail to reject the null hypothesis. He recommended the use of thresholds such as 0.05, although this threshold is the subject of debate and criticism today.
In 1928, Neyman and Pearson [2,7] developed a theoretical framework for hypothesis testing that laid the foundation for modern statistics. Their most important innovation was the formalization of hypothesis testing with the introduction of new terms:
  • Null hypothesis and alternative hypothesis: Neyman and Pearson emphasized the importance of using two competing hypotheses—the null hypothesis (H0) which asserts that there is no effect or difference, and the alternative hypothesis (H1) which assumes that there is an effect or difference.
  • Errors in hypothesis testing: They developed the concept of errors that can occur during hypothesis testing. They recognized two types of errors:
    (1)
    Type I error (false positive): An error in which the null hypothesis is rejected even though it is true.
    (2)
    Type II error (false negative): An error in which the null hypothesis is not rejected even though it is false.
  • Critical region and alpha level: They suggested setting a critical region and a significance threshold (alpha level). If the test statistic falls within the critical region determined by the significance level α, the null hypothesis is rejected.
This framework, known as Neyman–Pearson, theory, laid the foundation for modern hypothesis testing techniques (for more details see [8,9]). Following the work of Fisher and Neyman–Pearson, the statistical community began to adopt and develop hypothesis testing across disciplines. Hypothesis testing has become a tool in many research fields such as medicine, psychology, economics, engineering, and the social sciences. This made it possible to make more precise conclusions in research and experiments

3. Null and Alternative Hypotheses

At the core of statistical hypothesis testing lies the formulation of two mutually exclusive statements:
  • The null hypothesis H0 typically represents a baseline assumption or status quo, such as “no difference” or “no effect.”
  • The alternative hypothesis H1 reflects the presence of an effect, difference, or association that the researcher aims to detect.
Formally, a hypothesis test is a decision procedure that evaluates the plausibility of H0 based on observed data. Depending on the research question, the general format of H0 and H1 is:
H 0 :   θ Θ 0 H 1 :   θ Θ 0 C ,
where θ is a population parameter, Θ 0 is some subset of the parameter and Θ 0 C is its complement (see [10]). In hypothesis testing, after sampling and testing a test statistic we decide either to reject H0 or not to reject H0. During the procedure, we determine two regions-the rejection region and non-rejection region.

3.1. Types of Errors and Statistical Power

A hypothesis test can result in two kinds of errors:
  • Type I error (α): Incorrectly rejecting H0 when it is true (false positive). Then for θ Θ 0 , the probability of a Type I Error is P θ ( X R ) , where R is a rejection region and X is a sample (see [10]).
  • Type II error: Failing to reject H0 when H1 is true (false negative). Then for θ Θ 0 C , the probability of a Type II Error is P θ ( X R C ) , where R C is a non-rejection region and X is a sample (see [10]).
The power function of a test is the function β ( θ ) = P θ ( X R ) . The ideal power function is near 0 for θ Θ 0 and near 1 for θ Θ 0 C . Two types of errors are given in Table 1.

3.2. Methods of Finding Tests

The first one is a very general method. Let X = ( X 1 , …, X n ) be a random sample from population with pdf or pmf f( x |θ). The likelihood function is
L ( θ | X 1 , ,   X n ) = i = 1 n f ( X i | θ ) .
Definition [10]: The likelihood ratio test (LRT) statistic for testing   H 0 : θ Θ 0 versus H 1 : θ Θ 0 c is
λ ( X ) = sup Θ 0   L ( θ | X ) sup Θ   L ( θ | X ) .
A likelihood ratio test is any test that has a rejection region of the form { X : λ ( X ) c }   where c satisfies 0 c 1 and Θ denotes the entire parameter space.

4. Classical Statistical Tests

Statistical hypothesis testing includes a variety of classical procedures, each designed for specific types of data and assumptions. This section presents the most commonly used parametric tests. Each test relies on theoretical distributions and assumptions such as normality, independence, and homogeneity of variance.

4.1. Z-Test

The z-test is used to determine whether a population mean significantly differs from some hypothetical value when the population standard deviation σ is known and the sample size is sufficiently large (or assuming a normal distribution). The null and the alternative hypothesis for the two-tailed test are
H 0 :   μ = μ 0 H 1 :   μ μ 0 .
If we apply LRT we get rejection region (see [10])
| X ¯ n μ 0 | z α / 2 σ n ,
where X ¯ n = i = 1 n X i n is a sample mean and z α / 2 satisfies P ( Z z α / 2 ) = α / 2 with Z ~ N ( 0 , 1 )  Figure 1 shows the rejection region.

4.2. Student’s t-Test

When σ is unknown, we use the t-test for testing statistical hypothesis about the population mean.

4.2.1. One Sample Problem

In case of one sample, it adjusts for the increased uncertainty by using the sample standard deviation and a t-distribution with n − 1 degrees of freedom. The null and the alternative hypothesis for the two-tailed test are (see [10]):
H 0 :   μ = μ 0 H 1 :   μ μ 0 .
If we apply LRT we get rejection region
| X ¯ n μ 0 | t n 1 ; α / 2 S n ,
where S n 2   =   i = 1 n ( X i X ¯ n ) 2 n 1 is a sample variance and t n 1 ; α / 2 satisfies P ( T t n 1 ; α / 2 ) =   α / 2 with T ~ t n 1 . See Figure 2.

4.2.2. Two Samples Problem

Let X 1 , …, X n 1 and Y 1 , …, Y n 2 be two samples from two normal distributions with unknown but equal variances. The null and the alternative hypothesis for the two-tailed test are
H 0 :   μ 1 = μ 2 H 1 :   μ 1 μ 2 .
If we apply LRT we get rejection region (see [10])
| X ¯ n 1 Y ¯ n 2 S p 2 ( 1 n 1 + 1 n 2 ) | t n 1 + n 2 2 ; α / 2
where S p 2 is a pooled variance etimator and t n 1 + n 2 2 ; α / 2 satisfies P ( T t n 1 + n 2 2 ; α / 2 ) =     α /2 with T ~ t n 1 + n 2 2 .

4.3. Chi-Square (χ2) Test

The χ2 test is used to determine whether a population variance significantly differs from some hypothetical value (assuming a normal distribution). The null and the alternative hypothesis for the two-tailed test are (see [11]):
H 0 :   σ 2 = σ 0 2 H 1 :   σ 2 σ 0 2 .
The test statistic is of the form
χ 2 = ( n 1 ) S 2 σ 0 2 .
and has χ 2 distribution with n 1 degrees of freedom. When we apply LRT, we get rejection regions χ 2 χ n 1 ; α / 2 2 and χ 2 χ n 1 ; 1 α / 2 2 , which is shown is Figure 3.

4.4. Analysis of Variance (ANOVA)

One–way ANOVA tests whether three or more group means are significantly different. Model is of the form (see [10]):
Y i j = θ i + ε i j ,   i = 1 , , k ,         j = 1 , , n i .
where k is the number of groups and n i ,   i = 1 , , k , are the sample sizes. We have the next assumptions:
(1)
Homoscedasticy (equality of variance, Levene’s test [12] is often used to verify),
(2)
Errors ε i j are i.i.d with normal distribution and zero covariance,
(3)
E ( ε i j ) = 0 ,   V a r ( ε i j ) = σ i 2 < .
The null and the alternative hypothesis for the two-tailed test are
H 0 :   θ 1 = θ 2 = = θ k
H 1 :   θ i θ j ,   for   some   i ,   j .
The null and the alternative hypothesis can be written in another way (Theorem 11.2.5, [10]):
H 0 :   i = 1 k a i θ i = 0   for   all   a A
H 1 :   i = 1 k a i θ i 0   for   some   a A ,
where A = { a = ( a 1 , , a k ) : i = 1 k a i = 0 } . We have some notation here:
Y i . = j = 1 n i Y i j ,   Y . j = i = 1 k Y i j ,   Y = = 1 N i = 1 k j = 1 n i Y i j , N = i = 1 k n i ,
S i 2 = 1 n i 1 j = 1 n i ( Y i j Y ¯ i . ) 2 ,   i = 1 , , k ,
S p 2 = 1 N k j = 1 n i ( n i 1 ) S i 2 = 1 N k i = 1 k j = 1 n i ( Y i j Y ¯ i . ) 2 .
The test statistic can be of the form
i = 1 k a i Y ¯ i . i = 1 k a i θ i S p 2 i = 1 k a i 2 / n i .
and has Student’s distribution with N k degrees of freedom. If we apply LRT we get rejection region at level α
| i = 1 k a i Y ¯ i . i = 1 k a i θ i S p 2 i = 1 k a i 2 / n i | > t N k , α / 2 .
If we denote this statistic by Ta, i.e.,
T a = | i = 1 k a i Y ¯ i . i = 1 k a i θ i S p 2 i = 1 k a i 2 / n i | ,
from the union-intersection methodology follows that if we could reject for any a, we could reject for the a that maximizes Ta. It means that we reject H0 if ([10]):
sup a   T a   >   k   where   P H 0 ( sup a   T a   >   k ) = α .
Maximizing Ta is equivalent to maximizing Ta2. We have (see [10]):
T a 2   =   ( i = 1 k a i Y ¯ i . i = 1 k a i θ i ) 2 S p 2 i = 1 k a i 2 / n i   =   ( i = 1 k a i U ¯ i ) 2 S p 2 i = 1 k a i 2 / n i ,   U ¯ i = Y ¯ i . θ i .
According to Theorem 11.2.8 [10] we have
s u p a : a i = 0 T a 2 = ( i = 1 k n i ( ( Y ¯ i . Y = ) ( θ i θ ¯ ) ) 2 S p 2 ,
where Y = = n i Y ¯ i . / n i and θ ¯ = n i θ i / n i . Under the assumptions,
s u p a : a i = 0   T a 2 k 1 .
has F distribution with k 1 and N k degrees of freedom. Finally, we get rejection region
( i = 1 k n i ( ( Y ¯ i . Y = ) ) 2 k 1 S p 2 > F k 1 ,   N k ,   α ,
which is shown is Figure 4.

4.5. Hypothesis Testing in Regression Analysis

In the classical linear regression model we have ([11]):
Y = Xβ + ε,
where Y ∈ ℝⁿ is the vector of dependent observations, X   R n x p is a matrix of independent observations, β ∈   R p is the vector of unknown regression coefficients, and the errors ε are assumed to satisfy:
E[ε] = 0, Var[ε] = σ2Iₙ.
Under these assumptions, the ordinary least squares (OLS) estimator is given by ([11,13]):
β ^ = ( X T X ) 1 X T Y ,
and it holds that
β ^ ~ N ( β , σ 2 ( X T X ) 1 ) .

4.5.1. Testing Individual Coefficients

To test the null hypothesis concerning a single regression parameter,
H 0 :   β j = β j , 0
against the alternative H1: β j β j , 0 we use the t-statistic
t j   = β ^ j β j , 0 s ( β ^ j )   = β ^ j β j , 0 σ ^ v j j
where v j j denotes the j-th diagonal element of ( X T X ) 1 and
σ ^ 2 = Y X β ^ 2 n p .
Under H0, tⱼ follows Student’s t-distribution with (np) degrees of freedom, i.e., tⱼ ~ t n p . The null hypothesis is rejected at significance level α if |tⱼ| > t n p , 1 α / 2 .

4.5.2. Joint Hypothesis Testing

More generally, we may test a linear restriction on parameters of the form
H0: Rβ = r,
where R is a q × p restriction matrix of rank q.
The test statistic is the F-statistic, given by ([11]):
F   =     ( R β ^ r ) T [ R ( X T X ) 1 R T ] 1 ( R β ^ r ) / q σ ^ 2 .
Under the null hypothesis, it holds F ~ F q , n p . A common special case is testing the overall significance of the model:
H0: β1 = β2 = … = β = 0,
which corresponds to R = [0px1 | Iₚ]. In this case, the F-statistic reduces to
F = SSR / p SSE / ( n p )
where SSR is the regression sum of squares and SSE is the error sum of squares.

4.5.3. Connection to Likelihood Ratio Tests

For Gaussian errors, the t- and F-tests coincide with the likelihood ratio tests (LRTs). The LRT statistic for testing H0: Rβ = r, is (see [13,14,15]):
λ = max H 0   L ( β , σ 2 ) max H 1   L ( β , σ 2 )
and it can be shown that
2 log λ = q   log ( 1 + F q n p )
which is a monotonic function of F; hence both tests yield equivalent decisions under normality.

4.5.4. Extensions

In the presence of heteroscedasticity or autocorrelation, OLS-based tests lose validity. In such cases, robust standard errors or generalized least squares (GLS) are used. For nonlinear models, the Wald, Lagrange Multiplier (Score), and Likelihood Ratio tests provide asymptotically equivalent inference under regularity conditions.

4.6. p-Values and Significance

The p-value is the probability, under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one observed. If p-value ≤ α, we reject H0, indicating statistically significant evidence against it. A small p-value does not measure the size or importance of an effect. It merely suggests incompatibility between the data and the null hypothesis.

4.6.1. Neyman–Pearson Framework vs. Fisher Testing

Two philosophical approaches have historically shaped hypothesis testing:
  • Fisher’s approach [1,16,17]: Introduced the concept of the p-value and emphasized significance as evidence against H0, without requiring an alternative.
  • Neyman–Pearson approach [2,18]: Two hypotheses are assumed to represent the whole population. Testing is a decision-making process between two hypotheses, with explicit control of error rates. For more details see Table 2.
Both approaches are still used today, often in hybrid form (for conceptual blurring between p-values and significance level see [19]).

4.6.2. Critical Comparison of Fisher and Neyman–Pearson Paradigms

Building on the formal distinctions outlined above, we now critically examine the conceptual and practical implications of these two paradigms. Statistical hypothesis testing is traditionally grounded in two distinct frameworks: Fisher’s significance testing and the Neyman–Pearson decision-theoretic paradigm. Although often combined in applied work, these approaches differ conceptually and mathematically, with important implications for modern statistical practice.
Fisher’s framework interprets the p-value as a continuous measure of evidence against the null hypothesis, defined as a tail probability under the null distribution. Its main advantage lies in its flexibility and evidential interpretation, making it useful in exploratory analyses. However, Fisher’s approach does not explicitly specify an alternative hypothesis nor control Type II error, limiting its suitability for formal decision-making and power optimization.
The Neyman–Pearson paradigm, by contrast, formulates hypothesis testing as a binary decision problem between competing hypotheses with pre-specified error rates. It provides a rigorous mathematical foundation through concepts such as rejection regions, power functions, and optimality results, including the Neyman–Pearson lemma. Nevertheless, its strict reliance on fixed significance levels and dichotomous conclusions can be restrictive in complex, data-driven settings.
In modern applications—such as high-dimensional inference, multiple testing, and machine learning—neither paradigm alone fully addresses challenges related to multiplicity, dependence, and model uncertainty. Consequently, contemporary practice increasingly adopts hybrid approaches, integrating classical testing with resampling methods, false discovery rate control, Bayesian inference, and estimation-based reporting.
The ongoing p-value controversy reflects these foundational tensions between evidential and decision-based perspectives. Alternatives such as confidence intervals, Bayes factors, and decision-theoretic criteria aim to provide richer and more transparent measures of statistical evidence. A clear understanding of the strengths and limitations of both paradigms remains essential for the appropriate application and further development of hypothesis testing methods.

5. Advanced and Alternative Methods in Hypothesis Testing

Classical hypothesis testing methods rely on specific assumptions such as normality, homogeneity, and large sample sizes. In practice, however, these assumptions are often violated. Furthermore, as data become more complex and high-dimensional, traditional methods can lose accuracy or interpretability. This section presents several advanced and alternative methods that aim to overcome the limitations of classical testing.

5.1. Bayesian Hypothesis Testing

In Bayesian hypothesis testing, we start with a prior—our initial belief about the plausibility of a hypothesis before observing any data. When new data become available, we evaluate the likelihood, which tells us how probable the observed data is under each hypothesis. Using Bayes’ theorem, we then update our belief and obtain the posterior, which reflects the revised probability of the hypothesis after incorporating the data.
Unlike the frequentist approach, which assesses the probability of data given a hypothesis, Bayesian inference evaluates the probability of a hypothesis given the data. Formally, assume that
P ( H 0 ) = p 0   and   P ( H 1 ) = 1 p 0 .
Suppose we have random sample X = ( X 1 , …, X n ) . We know the distributions f X ( x H 0 ) and f X ( x H 1 ) . According to the Bayes theorem, we get the posterior probabilities of the hypotheses (see [20,21,22]):
P ( H 0 | X = x )   =   f X ( x H 0 ) P ( H 0 ) f X ( x )   and   P ( H 1 | X = x )   =   f X ( x H 1 ) P ( H 1 ) f X ( x ) .
The maximum a posteriori test is based on the following idea. We do not reject H 0 if and only if
P ( H 0 | X = x ) P ( H 1 | X = x )
or
f X ( x | H 0 ) P ( H 0 )   f X ( x | H 1 ) P ( H 1 ) .
Figure 5 shows the prior distribution for the hypotheses and the likelihood function, which indicates the support of the observed data for each hypothesis. The posterior distribution is obtained by updating the prior with the likelihood.
Beyond formal Bayesian hypothesis tests based on posterior probabilities, Bayesian inference also offers important advantages from a decision-theoretic and estimation perspective. Bayesian approaches provide a conceptually distinct framework for hypothesis testing and uncertainty quantification. Unlike frequentist methods, which rely on asymptotic approximations and long-run error rates, Bayesian inference explicitly models uncertainty about parameters through prior distributions. This allows for regularization and shrinkage in finite samples, which can stabilize estimation in settings with limited data or weak identification. For instance, consider a parametric model with parameter vector θ and sample size n. If the prior mean μ0 is chosen such that ∥ μ 0 θ 0 ∥ = O ( p n ) (here θ 0 is the true, unknown parameter value in the model) the posterior mean may achieve lower quadratic risk than the maximum likelihood estimator due to reduced variance, at the cost of a controlled bias. As n grows, Bayesian and frequentist estimators converge asymptotically, but in small or moderate samples the Bayesian framework can offer more reliable uncertainty quantification.
Furthermore, credible intervals derived from the posterior are not merely Bayesian analogues of confidence intervals; they represent a mechanism for propagating parameter uncertainty in prediction and decision-making. This feature is particularly valuable in modern applications, such as high-dimensional models or machine learning pipelines, where classical asymptotic approximations may be inaccurate. By explicitly incorporating prior information, Bayesian methods can clarify identifying assumptions and provide uncertainty-aware inference, complementing frequentist hypothesis testing and supporting more robust decision-making under limited data conditions.

5.2. Nonparametric Tests

When data violate parametric assumptions, nonparametric tests provide a robust alternative. These methods do not rely on specific distributional assumptions and are often based on ranks rather than raw values. Nonparametric tests are especially useful in small-sample studies or when data are ordinal or heavily skewed.

5.2.1. Goodness-of-Fit Test ( χ 2 Test)

Let X = ( X 1 , …, X n ) be a random sample from population with unknown cdf F ( x ) . The null and the alternative hypothesis for the two-tailed test are (see, for example, [23]):
H 0 :   F ( x ) = F 0 ( x ) ,   for   every   x R ,
H 1 :   F ( x )     F 0 ( x ) ,   for   at   least   one   x R .
The test statistic is of the form
χ 2 = k = 1 r ( M k n p k ) 2 n p k ,
where R is divided into r mutually disjoint subsets S 1 , , S r . Here, M k ,   k = 1 , , r , represents the number of X i ,   i = 1 , , n , belongs to M k and p k = P ( X S k ) under H0 is true. The rejection region is of the form
{ X : χ r 1 2 χ r 1 ; α 2 }  

5.2.2. Kolmogorov–Smirnov Test (KS Test)

The Kolmogorov–Smirnov test is a nonparametric statistical test that compares empirical distribution functions. It measures the maximum distance between two cumulative distribution functions (CDFs) to determine whether a sample follows a specified distribution (one-sample KS test) or whether two samples come from the same distribution (two-sample KS test).
One-sample KS Test tests whether a sample comes from a specified continuous distribution. The null and the alternative hypothesis are (see [23]):
H0: 
The data come from the theoretical distribution F(x).
H1: 
The data do not come from F(x).
Given a sample of n independent observations X 1 , …, X n , we define the empirical cumulative distribution function as:
F n ( x ) = 1 n i = 1 n I { X i x } ,
where I is the indicator function. The KS statistic is:
D n = s u p x | F n ( x ) F ( x ) | .
Under the null hypothesis (assuming a fully specified continuous distribution), the distribution of D n is known. Rejection region is { D n > d } . We get value d from KS tables. For large n, the test statistic can be compared to critical values from the Kolmogorov distribution:
P ( D n > d ) Q K S ( n d ) ,
where Q K S ( x ) = 2 k = 1 ( 1 ) k 1 e 2 k 2 x 2 .   A two-sample KS Test tests whether two independent samples are drawn from the same (unspecified) continuous distribution. The null and the alternative hypothesis are
H0: 
The two samples come from the same continuous distribution.
H1: 
The two samples come from different distribution.
Given two samples X 1 , …, X n 1 of size n 1 and Y 1 , …, Y n 2 of size n 2 , the empirical CDFs are:
F n 1 ( x ) = 1 n 1 i = 1 n 1 I { X i x } and   G n 2 ( x ) = 1 n 2 j = 1 n 2 I { Y j x } .
The two sample KS statistic is [23]:
D n 1   , n 2 = s u p x | F n 1   ( x ) G n 2 ( x ) | .
The rejection region is { D n 1   , n 2 > d } . For large sample sizes, the distribution of the test statistic under H0 can be approximated using the following scaling:
n 1 n 2 n 1 + n 2 D n 1   , n 2 d   Kolmogorov distribution .
In the one-sample test, the distribution F(x) must be fully specified (i.e., parameters known). KS test is:
-
less powerful than parametric tests when their assumptions are met
-
not optimal for comparing discrete distributions unless modifications are made.

5.2.3. Mann–Whitney U Test: For Independent Samples (Alternative to Two-Sample t-Test)

The Mann–Whitney U test is a nonparametric alternative to the two-sample t-test. It is used to assess whether two independent samples come from the same distribution. This test ranks all data points from both groups together and then compares the sum of ranks between the groups. It is particularly useful when the assumption of normality is violated or when the data are at ordinal scale. Suppose we have two samples: X 1 , …, X n 1 and Y 1 , …, Y n 2 .
Combine the samples and rank all observations. Let R i be the rank of the i-th observation. The U statistic is defined as ([24]):
U X = n 1 n 2 + n 1 ( n 1 + 1 ) 2 R X and   R X = i = 1 n 1 r a n k ( X i ) .
The test statistic U is:
U = min ( U X , U Y ) .
Under the null hypothesis H0: F X = F Y , the U statistic approximates a normal distribution for large samples.

5.2.4. Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is the nonparametric counterpart of the paired t-test. It is used for comparing two related or matched samples. The test is based on the ranks of the absolute differences between paired observations, considering the direction of change (positive or negative). It assumes that the differences are symmetrically distributed around the median.
This test is used for paired data ( X i , Y i ) or for a single sample of paired differences (see [25]):
D i = X i Y i ,   i = 1 , 2 , , n .
Discard pairs with D i = 0 , then rank the absolute differences | D i | .     Let   R i   be the rank of | D i | and let sign( D i ) be the sign of each difference. The test statistic is:
W = i = 1 n sign ( D i )   ·   R i .
Often the test uses the sum of positive ranks W + or the minimum of W + and W . Under
H 0 :   median   ( D i ) = 0 ,
the distribution of W is known or approximated by the normal distribution for large n.

5.2.5. Kruskal–Wallis Test

The Kruskal–Wallis test generalizes the Mann–Whitney U test to more than two independent groups. It serves as a nonparametric alternative to one-way ANOVA. The test evaluates whether there is a statistically significant difference between the distributions of three or more groups by analyzing the ranks of the combined data across all groups. It assumes that the observations are independent and come from distributions with the same shape.
Suppose we have k groups with sizes n 1 , , n k and N = i = 1 k n i . Rank all observation from all groups together. Let R j be the sum of ranks in group j. The test statistic is (see [26]):
H = 12 N ( N + 1 ) j = 1 k R j 2 n j 3 ( N + 1 ) .
Under H0: all samples come from identical distributions, H approximately follows a χ 2 distribution with k − 1 degrees of freedom.

5.2.6. Friedman Test

The Friedman test is the nonparametric analog of repeated-measures ANOVA. It is used to detect differences in treatments across multiple test attempts when the same subjects are measured under different conditions. Data are ranked within each subject (row), and the test evaluates whether the rank sums differ significantly across conditions (columns).
This test is used for randomized-block across k conditions with n subjects. Let R i be the sum of ranks for each group or condition. The test statistic is (see [27]):
Q = 12 n k ( k + 1 ) j = 1 k R i 2 3 n ( k + 1 ) .
Under H0: all treatment effects are equal, the test statistic Q approximately follows a χ 2 distribution with k − 1 degrees of freedom (for sufficiently large n).

5.3. Resampling-Based Methods

Permutation tests and bootstrap methods are powerful tools that use resampling to estimate the sampling distribution of a test statistic. These methods are particularly valuable when dealing with non-standard test statistics or small sample sizes. It is important to emphasize that in permutation tests, sampling is performed without replacement, whereas in bootstrap methods, sampling is performed with replacement.

5.3.1. Permutation Tests

Permutation test generates the null distribution by randomly shuffling labels in the data. This test (also known as a randomization test or exact test) is a non-parametric method for hypothesis testing that relies on the combinatorial structure of the sample space rather than on asymptotic or distributional assumptions. Let X = ( ( X 1 , …, X n ) denote a sample from a population, and suppose we want to test a null hypothesis H0 that implies the joint distribution of X is invariant under a certain group of permutations 𝒢Sn, where Sn is the group of all n! permutations of {1, 2, …, n}. Formally, H0: F( x 1 , …, x n ) = F( x π ( 1 ) , …, x π ( n ) ) for all π ∈ 𝒢. Under H0, every rearrangement of the observed data induced by elements of 𝒢 is equally likely.
Let T(X) be a test statistic, typically chosen such that large (or small) values provide evidence against H0. The permutation distribution of T under H0 is given by ([28,29]):
P H 0 ( T ( X ) t )   =   1 | G | π     G I ( T ( X π )     t ) ,
where X π = ( X π ( 1 ) ,…, X π ( n ) ) and I(·) is the indicator function. The p-value for the observed statistic T o b s is then computed as
p = 1 | G | π     G I ( T ( X π )     T o b s ) .
This p-value represents the proportion of all possible rearrangements of the data that produce a test statistic at least as extreme as the observed one. Intuitively, it quantifies how unusual the observed result is under the null hypothesis, without relying on assumptions like normality. In practice, when |𝒢| is large, a Monte Carlo approximation is used by randomly sampling a subset of permutations.
Permutation tests have the desirable property of being exact, i.e., the significance level is controlled exactly under H0, without requiring assumptions such as normality or equal variances. Furthermore, they are invariant under transformations that preserve the test statistic and are closely connected to the concept of exchangeability.

5.3.2. Bootstrap Method

Bootstrap method involves sampling with replacement to create “new” datasets. It is used to estimate standard errors, confidence intervals, and p-values.
Here we present the basic ideas of the bootstrap method. Let X = (   X 1 , …, X n ) be a random sample. Based on this sample we generate a large number of bootstrap samples of size n, denoted by X * 1 , X * 2 , , X * B . These samples were drawn with replacement from the original sample X = (   X 1 , …, X n ) . Bootstrap distribution of some statistics represents the distribution of values of this statistics in each resample.
We conclude this subsection with a bootstrap test. In two-sample problem, we have realized samples z = ( z 1 ,   z 2 , ,   z n ) and y = ( y 1 ,   y 2 , ,   y m ) from possibly different probability distributions F and G . The null hypothesis which is tested is H 0 : F = G . Let x be the combined simple which consists of all n + m observations. Test statistic can be of the form
t ( x ) = z ¯ y ¯ V a r ( z ) n + V a r ( y ) m   .
Bootstrap algorithm for calculation p-value is of the form ([28]):
  • Draw B samples of size n + m with replacement from x . Let us denote the first n observations z * and the remaining m observations y * .
  • Evaluate t ( · ) on each bootstrap sample:
    t ( x * b ) =   z ¯ * y ¯ * V a r (   z * ) n + V a r ( y * ) m , b = 1 , 2 , , B .
    Here, t is calculated for each bootstrap resample. This generates a distribution of t-values under resampling, approximating how the statistic would vary if the experiment were repeated.
  • Approximate p-value with
    # {   t ( x * b )     t o b s } / B ,
    where t o b s = t ( x ) is the observed value of the statistic. The bootstrap p-value is the fraction of resampled statistics that are at least as extreme as the observed value. It provides an empirical measure of significance without assuming a specific distribution.
The theoretical justification for resampling relies on the convergence of the empirical permutation or bootstrap distribution to the true sampling distribution of the statistic. Monte Carlo approximations, used when the number of possible permutations is large, provide consistent estimates of p-values and preserve the validity of hypothesis tests under general conditions.

6. Hypothesis Testing in High-Dimensional Data

In fields like finance and machine learning, researchers often test thousands of hypotheses simultaneously, leading to an increased risk of Type I errors. In high-dimensional settings, we deal with data where the number of variables p is large—often larger than the number of observations. Typically, we conduct multiple simultaneous hypothesis tests, one for each variable:
H 0 j :   μ j = 0 H 1 j :   μ j 0 , j = 1 , , p .
In high-dimensional settings, testing many hypotheses inflates the risk of false positives. Two main strategies are used to control this are:
  • Bonferroni correction
We divide overall significance level with the number of tests:
α j = α p .
Although simple and universally valid, the Bonferroni method is highly conservative for large p, leading to low power—especially when the tests are weakly dependent or sparse signals exist.
2.
Benjamini–Hochberg procedure
Benjamini and Hochberg [30] proposed controlling the expected proportion of false discoveries among the rejected hypotheses—the False Discovery Rate (FDR). Sort the p-values p ( 1 ) p ( p ) . Then define:
k = m a x { j : p ( j ) j p α }
Reject all null hypotheses H ( 1 ) , , H ( k ) . This measure controls the expected proportion of false positives among rejected hypotheses.

6.1. High-Dimensional Mean Testing

Assume that X i , j ~ N p ( μ i , Σ ) , for j = 1 , , N i ,   i = 1 , 2 . We want to test:
H 0 :   μ 1 = μ 2 H 1 :   μ 1 μ 2 .
The Hotelling’s T2 test statistic is
T 2 = η ( X ¯ 1 X ¯ 2 ) A 1 ( X ¯ 1 X ¯ 2 ) ,
where X ¯ 1 ,     X ¯ 2 are sample means, A = i j ( X i j X ¯ i ) ( X i j X ¯ i ) , η = n N 1 N 2 N 1 + N 2 ,   n = N 1 + N 2 2 . This test is not feasible when p > n. Bai & Saranadasa [31] or Chen & Qin [32] tests use simplified statistics like:
T n = ( X ¯ 1 X ¯ 2 ) ( X ¯ 1 X ¯ 2 )   τ · tr ( S n ) ,
where S n = A / n and τ is a bias-correction constant. Intuitively, T n measures the squared distance between the two sample means, adjusted by a bias-correction term τ · tr ( S n ) , that accounts for variability in high-dimensional settings. Large values of T n   suggest a significant difference between the population means.
Under H0, with suitable assumptions, the statistic Tn follows an asymptotic normal distribution. Later, Chen and Qin [32] improved this framework by showing optimal power properties even when the covariance matrix exhibits complex dependence structures (see also [33]). Building on these classical results, recent theoretical advancements have refined high-dimensional mean testing by introducing methods that account for sparsity, complex dependence structures, and improved resampling-based approximations. These developments provide deeper engagement with contemporary mathematical results and highlight ongoing progress in the theory of pn inference.

6.2. Higher Criticism (HC) Test

Donoho and Jin [34] introduced the Higher Criticism (HC) statistic to detect sparse alternatives—settings where only a small proportion of the hypotheses are false. Sort p values from individual tests p ( 1 ) p ( p ) . Then define HC statistic:
H C = m a x 1 i p p ( i p p ( i ) ) p ( i ) ( 1 p ( i ) )   .
The HC statistic detects rare signals by comparing observed p-values to their expected uniform distribution. Large values indicate a small number of hypotheses are unusually significant. The HC test magnifies small, systematic deviations between observed p-values and their expected uniform distribution under H0. It is asymptotically powerful in detecting extremely weak and rare signals—a regime where traditional tests fail. Mathematically, the HC framework connects to empirical process theory and can be interpreted as a goodness-of-fit test for uniformity of p-values.

7. Hypothesis Testing in Machine Learning

Statistical hypothesis testing forms the theoretical backbone of rigorous model evaluation and experimental design in machine learning (ML), data science, and online experimentation. While traditional hypothesis testing concerns population parameters, in ML the focus shifts toward evaluating model performance, feature relevance, and treatment effects under uncertainty.

7.1. A/B Testing and Online Experiments

A/B testing is the cornerstone of empirical validation in industry and online systems. The goal is to compare two versions of a product—control (A) and treatment (B)—to test whether a modification produces a statistically significant improvement in a key performance metric, such as conversion rate or user engagement. Let X A and X B denote performance metrics from the two groups, with sample means X ¯ A and X ¯ B and variances s A 2 and s B 2 . The hypotheses are:
H 0 :   μ A     =   μ B v s . H 1 :   μ A       μ B   .
The classical two-sample t-statistic is:
t   =   X ¯ A X ¯ A s A 2 / n A + s B 2 / n B .  
When sample sizes are large, the test statistic approximately follows the standard normal distribution under H0. In web-scale experiments, sequential testing or Bayesian alternatives such as Thompson sampling are often used to control false discoveries while enabling adaptive experimentation.

7.2. Cross-Validation Significance Testing

Model comparison in machine learning often relies on cross-validation (CV), where data are partitioned into folds for repeated training and testing. Observed performance differences may arise from random data partitioning and training variability; therefore, careful statistical analysis is required to assess whether such differences reflect genuine model improvements.
Suppose two models M 1 and M 2   yield performance metrics across K cross-validation folds and let d i = score ( M 1 ) i score ( M 2 ) i . The null hypothesis is:
H 2 :   E [ d i ]   =   0 .
Since CV fold results are generally dependent, we use corrected resampled t-tests (see [35]), Bayesian methods or permutation-based tests to obtain more reliable inference on whether observed performance differences reflect genuine model improvement rather than random variation.

7.3. Permutation and Bootstrap-Based Feature Significance

Permutation and bootstrap approaches offer model-agnostic methods for significance testing of feature importance. Given a fitted model f and prediction function y ^ i = f ( x i ) , the null hypothesis for a feature X j is:
H 0 :   X j   is   not   associated   with   Y .  
This hypothesis tests whether feature X j   has any predictive power. Repeated permutation or resampling assesses how much the model’s performance changes if X j   is shuffled. We repeatedly permute X j (B times) while keeping other features fixed and recompute a performance metric j . Then we calculate permutation-based p-value as we did in (63).
Permutation tests provide a robust, distribution-free way to assess variable relevance. Such permutation-based inference provides a robust alternative to parametric tests, particularly in nonlinear models such as random forests or neural networks (see [36]). Bootstrap resampling complements this by quantifying uncertainty in feature importance, particularly useful in ensemble and deep learning models.
From a theoretical perspective, permutation-based feature significance and bootstrap resampling provide rigorous control over type I error and quantify uncertainty in model outputs. Corrected resampled t-tests and related methods allow inference on dependent cross-validation results, ensuring that observed performance differences reflect true signal rather than random variation.

7.4. Multiple Testing and False Discovery Control in ML

Modern ML pipelines often involve evaluating thousands of features or hyperparameters, leading to multiple testing problems. To control the false discovery rate, procedures such those presented in [30] and knockoff methods (see [37]) are used. These methods maintain interpretability while ensuring reliable feature selection in high-dimensional spaces.

7.5. Reproducibility and Scientific Integrity

As ML research becomes increasingly empirical, hypothesis testing ensures that improvements are statistically significant and replicable. Recent efforts emphasize combining classical inference with Bayesian and information-theoretic approaches, offering a unified framework for uncertainty quantification in AI systems.

8. Applications Across Scientific Fields

Statistical hypothesis testing has widespread applications in nearly every empirical discipline. While the core principles remain the same, each field adapts hypothesis testing to address domain-specific challenges and data types. This section illustrates how hypothesis testing is applied in practice across medicine, economics, psychology, machine learning, and industry experimentation.

8.1. Medicine and Clinical Trials

In medical research, hypothesis testing is central to the design and evaluation of randomized controlled trials.
Example: If we want to test whether a new drug is more effective than a placebo, then:
H0: 
There is no difference in recovery rates.
H1: 
The new drug improves recovery rates.
Tests typically involve:
  • Two-sample t-tests (for continuous outcomes)
  • Chi-square tests (for categorical outcomes like event rates)
  • Log-rank tests (for survival analysis)
Regulatory agencies often require significance levels of α = 0.05 and pre-specified hypotheses for trial approval.

8.2. Economics and Policy Evaluation

Econometrics often uses hypothesis testing to validate theoretical models and evaluate the impact of public policies.
Example: If we want to test the effect of a tax policy on employment rates, then we can:
  • Use regression analyses tests (e.g., t-tests on regression coefficients)
  • May involve instrumental variables or difference-in-differences approaches
Hypothesis testing here is used to separate causality from correlation.

8.3. Psychology

In psychology, hypothesis development and testing is a structured method for investigating human behavior. It starts with observations that raise questions, such as whether caffeine improves alertness. Researchers formulate multiple hypotheses, generate predictions from them, and then collect data to determine which hypothesis best matches the observed results. Predictions must logically follow from the hypotheses and be testable.
In psychology we have: (1) publication bias: preference for statistically significant results and (2) low statistical power due to small sample sizes.
Statistical reform in psychology is a driving force behind more rigorous application of hypothesis testing.

8.4. Machine Learning and Data Science

Although machine learning often focuses on prediction rather than inference, hypothesis testing is increasingly used to evaluate models, compare algorithms, and understand features.
Applications: A/B testing, permutation tests, cross-validation comparisons
Hypothesis testing bridges the gap between black-box model outputs and actionable insights.

8.5. Industry and Business Analytics

In tech companies and digital marketing, hypothesis testing underpins experimental design for product development, ad optimization, and user engagement.
Example: If we want to test whether a new button color increases click-through rate, then we can:
H0: 
No difference between old and new versions.
H1: 
New version is better than the older.
Tools typically involve:
  • z-test for proportions, confidence intervals, online platforms, sequential testing, Bayesian monitoring.

9. Challenges, Misconceptions, and Modern Reform Movements

While statistical hypothesis testing remains foundational in empirical research, its widespread use has led to numerous misunderstandings, misuses, and systemic problems in scientific publishing. In recent years, these issues have sparked reform efforts across disciplines aiming to improve the reliability, transparency, and reproducibility of scientific findings.

9.1. Misinterpretation and Misuse of p-Values

Formal definition of the p-value is:
p = P ( Test   statistic   is   more   extreme   than   the   realized   value | H 0   is   true ) .
Let T be the test statistic and let T o b s be its realized value. Then
p = P ( T T o b s | H 0 ) for right tailed test
p = P ( T T o b s | H 0 ) for left tailed test
p = 2 min { P ( T T o b s | H 0 ) ,   P ( T T o b s | H 0 ) } for two tailed test
The p-value is frequently misunderstood:
  • A p-value does not indicate the probability that the null hypothesis is true.
  • A p-value does not measure the size or importance of an effect.
  • A small p-value does not guarantee that a result is statistically significant.
Misinterpretations have led to widespread misuse and “p-hacking,” where researchers consciously or unconsciously manipulate data, models, or analysis choices to achieve statistical significance (typically p < 0.05).

9.2. P-Hacking and Publication Bias

P-hacking refers to any strategy used to obtain statistically significant results, including:
  • Selective reporting of variables
  • Optional stopping (collecting more data until significance is reached)
  • Multiple testing without correction
  • Excluding “outliers” post hoc
Publication bias further amplifies this by favoring significant results over null findings, creating a distorted scientific record.
Figure 6 illustrates p-hacking strategies, showing the frequency of p-values across repeated analyses and how selective reporting can inflate the occurrence of apparently significant p-values near 0.05.

9.3. The Replication Crisis

A large-scale effort by the Open Science Collaboration [38] attempted to replicate some studies from high-impact psychology journals. Key findings:
  • Replication rate was 25% for social psychology
  • Replication rate was 50% for cognitive psychology
  • The average effect size was about half the original (note that effect size measures express the practical or clinical significance of differences among sample means, as contrasted with the statistical significance of the differences)
  • The reproducibility rate varied by field, method, and study quality
This crisis is not unique to psychology—it extends to economics, medicine, and even machine learning research.

9.4. ASA Statement and the “Beyond p < 0.05” Movement

In 2016, the American Statistical Association (ASA) issued a historic statement (see [39]) warning against overreliance on p-values: “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”
The ASA encouraged researchers to:
  • Focus on effect sizes, confidence intervals, and context
  • Avoid binary “significant/non-significant” thinking
  • Embrace transparency, pre-registration, and open science
This gave rise to the “Beyond p < 0.05” (see [40]), which advocates for a broader and more nuanced inferential framework. Key recommendations include:
  • Estimation-based inference
  • Bayesian methods
  • Decision-theoretic approaches
  • Registered reports and reproducible workflows

9.5. Recommended Reforms and Best Practices

To address the well-documented limitations of null hypothesis significance testing and the overreliance on p-values, several methodological reforms have been proposed and increasingly adopted across the social, behavioral, and biomedical sciences. These reforms are presented in Table 3 and aim to improve transparency, reproducibility, and the interpretability of statistical evidence (see [41]).

10. Conclusions and Future Directions

Statistical hypothesis testing remains one of the central methods in inferential statistics and in many empirical sciences. From the early ideas of Fisher, through the work of Neyman and Pearson, to modern approaches that include Bayesian statistics and methods for high-dimensional data, the concept of hypothesis testing has gradually developed into a framework that allows researchers to quantify uncertainty and make data-based conclusions. The classical approach—relying on null and alternative hypotheses, p-values, and error control—still forms the basis for interpreting results in fields such as medicine, economics, psychology, and machine learning.
However, the increasing complexity of data and the growth of large-scale studies have exposed limitations of traditional methods. Overreliance on fixed significance thresholds, widespread misuse of p-values, and challenges in reproducibility have prompted a methodological and philosophical re-examination of hypothesis testing. Modern research emphasizes effect sizes, confidence intervals, Bayesian evidence, and the integration of prior information, promoting a shift from binary decision-making toward a more nuanced understanding of statistical evidence.
Future research in hypothesis testing is expected to proceed along several key directions.
  • High-dimensional and complex data inference: The rise of big data and high-dimensional statistics calls for new test statistics that remain powerful and computationally efficient, where traditional asymptotics fail. Techniques such as random matrix theory, and resampling-based inference will continue to expand.
  • Integration with machine learning: As models become increasingly nonlinear and complex, there is a need for frameworks that allow hypothesis testing in “black-box” settings. Challenges include testing feature importance, model interpretability, and validation of deep learning results.
  • Reproducibility and transparency: The replication crisis across multiple disciplines has strengthened the call for more transparent workflows. Pre-registration of research plans, open data, and clearly defined analysis procedures are becoming increasingly standard practice.
  • Bayesian and decision-theoretic frameworks: Bayesian hypothesis testing offers a coherent probabilistic interpretation of evidence, enabling sequential analysis, adaptive experimentation, and model comparison. Expanding these approaches for large-scale and hierarchical models represents an important avenue for development.
  • Ethical and interpretative dimensions: As hypothesis testing continues to influence policy, medicine, and AI systems, future discussions must address the epistemological and ethical implications of statistical inference—ensuring that statistical significance aligns with practical significance and social impact.
In conclusion, while statistical hypothesis testing has undergone a century of refinement, its evolution is far from complete. The field stands at the intersection of classical rigor and computational innovation, striving for methods that are not only mathematically sound but also transparent, reproducible, and contextually meaningful. The future of hypothesis testing lies in unifying theoretical advances with responsible data practice—building a statistically literate and trustworthy foundation for the sciences of tomorrow.

Funding

This research was funded by Ministry of Science, Technological Development and Innovations of the Republic of Serbia.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Fisher, R.A. Statistical Methods for Research Workers; Oliver and Boyd: Edinburgh, UK, 1925. [Google Scholar]
  2. Neyman, J.; Pearson, E.S. On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference: Part I. Biometrika 1928, 20, 175–240. [Google Scholar]
  3. Arbutnot, J. An argument for divine providence, taken from the constant regularity observ’d in the births of both sexes. Philospohical Trans. 1710, 27, 186–190. [Google Scholar]
  4. Michell, J. An inquiry into the probable parallax, and magnitude of the fixed stars, from the quantity of light which they afford us, and the particular circumstances of their situation. Philospohical Trans. 1767, 57, 234–264. [Google Scholar]
  5. Pearson, K.X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1900, 50, 157. [Google Scholar] [CrossRef]
  6. Gosset, W. The probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]
  7. Neyman, J.; Pearson, E.S. On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika 1928, 20, 263–294. [Google Scholar] [CrossRef]
  8. Lehmann, E.L. The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? J. Am. Stat. Assoc. 1993, 88, 1242–1249. [Google Scholar] [CrossRef]
  9. Lehman, E.L. Fisher, Neyman and the Creation of Classical Statistics; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  10. Casella, G.; Berger, R.L. Statistical Inference; Thomson Learning Academic Resource Center: Pacific Grove, CA, USA, 2002. [Google Scholar]
  11. Greene, W.H. Econometric Analysis; Prentice Hall: Hoboken, NJ, USA, 2018. [Google Scholar]
  12. Levene, H. Robust tests for the analysis of variance, 278–292. In Contributions to Probability and Statistics; Olkin, I., Ed.; Stanford University Press: Redwood City, CA, USA, 1960. [Google Scholar]
  13. Wooldridge, J.M. Introductory Econometrics; Cengage: Boston, MA, USA, 2020. [Google Scholar]
  14. Seber, G.A.F.; Lee, A.J. Linear Regression Analysis; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
  15. Rao, C.R. Linear Statistical Inference and Its Applications, 2nd ed.; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
  16. Fisher, R.A. Design of Experiments; Oliver and Boyd: Edinburgh, UK, 1935. [Google Scholar]
  17. Fisher, R. Statistical Methods and Scientific Induction. J. R. Stat. Soc. 1955, 17, 69–78. [Google Scholar] [CrossRef]
  18. Neyman, J.; Pearson, E.S. Joint Statistical Papers of J. Neyman and E.S. Pearson; Cambridge University Press: Cambridge, UK, 1967. [Google Scholar]
  19. Hubbard, R. Alphabet soup. Blurring the distinction between p’s and a’s in psychological research. Theory Psychol. 2004, 14, 295–327. [Google Scholar] [CrossRef]
  20. Pishro-Nik, H. Introduction to Probability, Statistics and Random Processes. Kappa Research, L.L.C. 2014. Available online: https://www.probabilitycourse.com/chapter9/9_1_8_bayesian_hypothesis_testing.php (accessed on 20 October 2025).
  21. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; Chapman & Hall: Boca Raton, FL, USA, 2013. [Google Scholar]
  22. Schoot, R.; Depaoli, S.; King, R.; Kramer, B.; Märtens, K.; Tadesse, M.G.; Vannucci, M.; Gelman, A.; Veen, D.; Willemsen, J.; et al. Bayesian statistics and modelling. Nat. Rev. Methods Primers 2021, 1, 1. [Google Scholar] [CrossRef]
  23. Petrovic, L.J. Teorijska Statistika; Publishing Centre Faculty of Economics: Belgrade, Serbia, 2018. [Google Scholar]
  24. Mann, H.B.; Whitney, D.R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
  25. Wilcoxon, F. Individual comparisons by ranking methods. Biom. Bull. 1945, 1, 80–83. [Google Scholar] [CrossRef]
  26. Kruskal, W.H.; Wallis, W.A. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
  27. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
  28. Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; Chapman & Hall: Boca Raton, FL, USA, 1993. [Google Scholar]
  29. Altmann, A.; Tolosi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef] [PubMed]
  30. Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. 1995, 57, 289–300. [Google Scholar] [CrossRef]
  31. Bai, Z.; Saranadasa, H. Effect of high dimensions: By an example of a two sample problem. Stat. Sin. 1996, 6, 311–329. [Google Scholar]
  32. Chen, S.X.; Qin, Y.L. A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Stat. 2010, 38, 808–835. [Google Scholar] [CrossRef]
  33. Cai, T.; Liu, W.D. Adaptive thresholding for sparse covariance matrix estimation. J. Am. Stat. Assoc. 2011, 106, 672–684. [Google Scholar] [CrossRef]
  34. Donoho, D.; Jin, J. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 2004, 32, 962–994. [Google Scholar] [CrossRef]
  35. Nadeau, C.; Bengio, Y. Inference for the Generalization Error. Mach. Learn. 2003, 52, 239–281. [Google Scholar] [CrossRef]
  36. Strobl, C.; Boulesteix, A.L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 2008, 9, 307. [Google Scholar] [CrossRef] [PubMed]
  37. Barber, R.F.; Candès, E.J. Controlling the false discovery rate via knockoffs. Ann. Stat. 2015, 43, 2055–2085. [Google Scholar] [CrossRef]
  38. Open Science Collaboration. Estimating the reproducibility of psychological science. Science 2015, 349, 6251. [Google Scholar] [CrossRef] [PubMed]
  39. Wasserstein, R.L.; Lazar, N.A. The ASA statement on p-values: Context, process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
  40. Wasserstein, R.L.; Allen, L.S.; Lazar, N.A. Moving to a World Beyond “p <0.05”. Am. Stat. 2019, 73, 1–19. [Google Scholar]
  41. Lakens, D.; Mesquida, C.; Rasti, S.; Ditroilo, M. The benefits of preregistration and Registered Reports. Evid.-Based Toxicol. 2024, 2, 2376046. [Google Scholar] [CrossRef]
Figure 1. Rejection region for two tailed z test, showing, where the null hypothesis H0 would be rejected.
Figure 1. Rejection region for two tailed z test, showing, where the null hypothesis H0 would be rejected.
Mathematics 14 00300 g001
Figure 2. Rejection region for two tailed t test, showing, where the null hypothesis H0 would be rejected.
Figure 2. Rejection region for two tailed t test, showing, where the null hypothesis H0 would be rejected.
Mathematics 14 00300 g002
Figure 3. Rejection region for two tailed χ 2 test, showing, where the null hypothesis H0 would be rejected.
Figure 3. Rejection region for two tailed χ 2 test, showing, where the null hypothesis H0 would be rejected.
Mathematics 14 00300 g003
Figure 4. Rejection region for one tailed F test, showing, where the null hypothesis H0 would be rejected.
Figure 4. Rejection region for one tailed F test, showing, where the null hypothesis H0 would be rejected.
Mathematics 14 00300 g004
Figure 5. Bayesian Hypothesis Testing Visualization.
Figure 5. Bayesian Hypothesis Testing Visualization.
Mathematics 14 00300 g005
Figure 6. Illustration of p-hacking strategies.
Figure 6. Illustration of p-hacking strategies.
Mathematics 14 00300 g006
Table 1. Two types of errors.
Table 1. Two types of errors.
Decision\ConditionH0 TrueH0 False
Reject H0Type I error (α)Correct decision
Fail to reject H0Correct decisionType II error
Source: Literature review (author’s synthesis).
Table 2. Neyman–Pearson Framework vs. Fisher Testing.
Table 2. Neyman–Pearson Framework vs. Fisher Testing.
AspectFisher ApproachNeyman–Pearson Framework
FocusEvidence against H0Decision between H0 and H1
Use of alternativeNot requiredRequired
MainTest statistic, p-valueTest statistic, critical region
Control of errorsNot emphasizedα, β error considered
Source: Literature review (author’s synthesis).
Table 3. Recommended Reforms.
Table 3. Recommended Reforms.
ReformDescription
Pre-registrationHypotheses and analysis plans are specified before data collection
Registered reportsPeer review occurs prior to data collection
Reporting effect sizesReport standardized and unstandardized effect sizes
Confidence intervalsPresent estimates with uncertainty
Bayesian analysisQuantifies evidence and updates beliefs
Open data and codeEnhances reproducibility and secondary analysis
Source: Literature review (author’s synthesis).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rajić, V. Statistical Hypothesis Testing: A Comprehensive Review of Theory, Methods, and Applications. Mathematics 2026, 14, 300. https://doi.org/10.3390/math14020300

AMA Style

Rajić V. Statistical Hypothesis Testing: A Comprehensive Review of Theory, Methods, and Applications. Mathematics. 2026; 14(2):300. https://doi.org/10.3390/math14020300

Chicago/Turabian Style

Rajić, Vesna. 2026. "Statistical Hypothesis Testing: A Comprehensive Review of Theory, Methods, and Applications" Mathematics 14, no. 2: 300. https://doi.org/10.3390/math14020300

APA Style

Rajić, V. (2026). Statistical Hypothesis Testing: A Comprehensive Review of Theory, Methods, and Applications. Mathematics, 14(2), 300. https://doi.org/10.3390/math14020300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop