1. Introduction
Statistical hypothesis testing plays a fundamental role in modern scientific research, serving as a central tool for drawing inferences from data. First introduced in the early 20th century by pioneers such as Ronald A. Fisher [
1], Jerzy Neyman and Egon Pearson [
2], hypothesis testing has since evolved into a critical component of inferential statistics. It enables researchers to formally assess whether observed data are consistent with a proposed model or whether deviations are statistically significant.
Statistical hypothesis testing has long been a cornerstone of empirical research, providing a structured framework to quantify uncertainty and draw conclusions from data. Classical approaches, such as Fisher’s significance testing and the Neyman–Pearson framework, laid the foundation for modern inferential statistics. However, with the advent of high-dimensional data, machine learning applications, and large-scale experimental studies, traditional methods face new challenges, including inflated false discovery rates, non-standard test statistics, and complex dependency structures.
This article aims to provide a comprehensive and integrative perspective on hypothesis testing by:
Synthesizing classical and modern approaches, including Fisherian, Neyman–Pearson, Bayesian, resampling, and high-dimensional methods.
Offering a critical comparison of methodologies, highlighting their theoretical strengths, limitations, and practical applicability in contemporary data-rich environments.
Bridging theory and application, with detailed discussions on high-dimensional mean testing, permutation and bootstrap techniques, and model-agnostic inference in machine learning.
Identifying open research questions and unresolved issues, particularly in the context of reproducibility, p-value controversies, and statistical inference in complex models.
Unlike traditional textbooks or standard survey articles, this work emphasizes critical synthesis, comparative evaluation, and integration of recent theoretical results, providing a roadmap for both researchers and practitioners navigating classical and emerging methodologies in statistical hypothesis testing.
The classical approach to hypothesis testing typically involves formulating a null hypothesis (H0), representing a default assumption or status quo, and an alternative hypothesis (H1), representing the presence of an effect or difference. By comparing test statistics derived from sample data to theoretical distributions, researchers can compute p-values and make decisions regarding the rejection or non-rejection of H0.
Despite its widespread adoption across disciplines such as medicine, economics, psychology, and engineering, hypothesis testing is not without criticism. Concerns related to the misuse of p-values, the replication crisis, and statistical power have prompted statisticians and scientists to re-examine foundational principles and propose alternative methods, including Bayesian approaches and robust nonparametric techniques.
This review article aims to provide a comprehensive and structured overview of hypothesis testing, including its theoretical foundations, commonly used methods, and practical applications. We also address current debates and emerging trends, particularly in high-dimensional data settings and interdisciplinary research. By synthesizing classical and modern perspectives, this paper intends to serve as a resource for both newcomers and experienced researchers seeking a deeper understanding of hypothesis testing in statistical inference.
2. History of Hypothesis Testing
Before hypothesis testing became formalized as a statistical method, researchers relied on intuition, experience, and methods that were more focused on descriptive statistics. At that time, statistics was mainly a tool for collecting and organizing data, without rigorous methods for testing theories and hypotheses.
The conceptual roots of hypothesis testing can be traced to early empirical procedures such as the Trial of the Pyx (judicial ceremony) in medieval England. In this annual examination, coins produced by the Royal Mint were randomly sampled and compared with the expected weight of precious metals. If significant discrepancies were observed, penalties followed. This procedure embodied the core logic of modern hypothesis testing: establishing a hypothesis, drawing a random sample, comparing observed outcomes with theoretical expectations, and making inferences based on deviations from the null assumption. The logical foundation of testing hypotheses, often viewed as a proof by contradiction, has deep philosophical origins.
Experiments attributed to Galileo challenged Aristotle’s hypothesis on falling bodies by providing empirical evidence inconsistent with theoretical expectations. This reasoning—attempting to falsify rather than prove a statement—anticipates the structure of modern significance testing.
Early statistical applications appeared in the eighteenth century. John Arbuthnot [
3] analyzed birth records in London and argued that the consistent predominance of male births over female ones could not be explained by chance, proposing it as evidence of divine providence. Similarly, John Michell [
4] examined the distribution of stars and concluded that their apparent clustering was unlikely under randomness, thereby introducing inferential reasoning in astronomy.
Several scientists used statistical techniques to analyze probability in the 19th century, but formal methods of hypothesis testing did not exist.
Karl Pearson [
5] developed basic statistical tools such as the chi-square test and correlation. His achievements were an important foundation for the further development of statistics. William Gosset [
6] developed the
t-test for small samples, providing one of the first exact tests of significance.
One of the most important moments in the history of hypothesis testing was the work of Ronald A. Fisher, an English statistician who pioneered the foundations of modern statistical testing. In the 1920s, Fisher [
1] formulated several key principles in statistics:
Fisher’s probability method: Fisher introduced the concept of probability in the context of hypotheses, laying the foundation for today’s understanding of p-values. He recommended the use of the p-value as an indicator of the probability of obtaining a result consistent with the null hypothesis.
Fisher’s test: Basic statistical tests were developed, such as the F-test for analysis of variance, which enabled formalized testing of hypotheses about population parameters.
p-value: Fisher popularized the use of the p-value as a tool for deciding whether to reject or fail to reject the null hypothesis. He recommended the use of thresholds such as 0.05, although this threshold is the subject of debate and criticism today.
In 1928, Neyman and Pearson [
2,
7] developed a theoretical framework for hypothesis testing that laid the foundation for modern statistics. Their most important innovation was the formalization of hypothesis testing with the introduction of new terms:
Null hypothesis and alternative hypothesis: Neyman and Pearson emphasized the importance of using two competing hypotheses—the null hypothesis (H0) which asserts that there is no effect or difference, and the alternative hypothesis (H1) which assumes that there is an effect or difference.
Errors in hypothesis testing: They developed the concept of errors that can occur during hypothesis testing. They recognized two types of errors:
- (1)
Type I error (false positive): An error in which the null hypothesis is rejected even though it is true.
- (2)
Type II error (false negative): An error in which the null hypothesis is not rejected even though it is false.
Critical region and alpha level: They suggested setting a critical region and a significance threshold (alpha level). If the test statistic falls within the critical region determined by the significance level α, the null hypothesis is rejected.
This framework, known as Neyman–Pearson, theory, laid the foundation for modern hypothesis testing techniques (for more details see [
8,
9]). Following the work of Fisher and Neyman–Pearson, the statistical community began to adopt and develop hypothesis testing across disciplines. Hypothesis testing has become a tool in many research fields such as medicine, psychology, economics, engineering, and the social sciences. This made it possible to make more precise conclusions in research and experiments
3. Null and Alternative Hypotheses
At the core of statistical hypothesis testing lies the formulation of two mutually exclusive statements:
The null hypothesis H0 typically represents a baseline assumption or status quo, such as “no difference” or “no effect.”
The alternative hypothesis H1 reflects the presence of an effect, difference, or association that the researcher aims to detect.
Formally, a hypothesis test is a decision procedure that evaluates the plausibility of
H0 based on observed data. Depending on the research question, the general format of
H0 and
H1 is:
where
is a population parameter,
is some subset of the parameter and
is its complement (see [
10]). In hypothesis testing, after sampling and testing a test statistic we decide either to reject
H0 or not to reject
H0. During the procedure, we determine two regions-the rejection region and non-rejection region.
3.1. Types of Errors and Statistical Power
A hypothesis test can result in two kinds of errors:
Type I error (α): Incorrectly rejecting
H0 when it is true (false positive). Then for
the probability of a Type I Error is
, where
R is a rejection region and
X is a sample (see [
10]).
Type II error: Failing to reject
H0 when
H1 is true (false negative). Then for
, the probability of a Type II Error is
, where
is a non-rejection region and
X is a sample (see [
10]).
The power function of a test is the function
. The ideal power function is near 0 for
and near 1 for
. Two types of errors are given in
Table 1.
3.2. Methods of Finding Tests
The first one is a very general method. Let
X = (
, …,
) be a random sample from population with pdf or pmf
f(
|θ). The likelihood function is
Definition [
10]: The likelihood ratio test (LRT) statistic for testing
versus
is
A likelihood ratio test is any test that has a rejection region of the form where c satisfies and Θ denotes the entire parameter space.
4. Classical Statistical Tests
Statistical hypothesis testing includes a variety of classical procedures, each designed for specific types of data and assumptions. This section presents the most commonly used parametric tests. Each test relies on theoretical distributions and assumptions such as normality, independence, and homogeneity of variance.
4.1. Z-Test
The z-test is used to determine whether a population mean significantly differs from some hypothetical value when the population standard deviation σ is known and the sample size is sufficiently large (or assuming a normal distribution). The null and the alternative hypothesis for the two-tailed test are
If we apply LRT we get rejection region (see [
10])
where
is a sample mean and
satisfies
with
Figure 1 shows the rejection region.
4.2. Student’s t-Test
When σ is unknown, we use the t-test for testing statistical hypothesis about the population mean.
4.2.1. One Sample Problem
In case of one sample, it adjusts for the increased uncertainty by using the sample standard deviation and a
t-distribution with
n − 1 degrees of freedom. The null and the alternative hypothesis for the two-tailed test are (see [
10]):
If we apply LRT we get rejection region
where
is a sample variance and
satisfies
with
See
Figure 2.
4.2.2. Two Samples Problem
Let
, …,
and
, …,
be two samples from two normal distributions with unknown but equal variances. The null and the alternative hypothesis for the two-tailed test are
If we apply LRT we get rejection region (see [
10])
where
is a pooled variance etimator and
satisfies
) =
/2 with
.
4.3. Chi-Square (χ2) Test
The χ
2 test is used to determine whether a population variance significantly differs from some hypothetical value (assuming a normal distribution). The null and the alternative hypothesis for the two-tailed test are (see [
11]):
The test statistic is of the form
and has
distribution with
degrees of freedom. When we apply LRT, we get rejection regions
and
, which is shown is
Figure 3.
4.4. Analysis of Variance (ANOVA)
One–way ANOVA tests whether three or more group means are significantly different. Model is of the form (see [
10]):
where
k is the number of groups and
are the sample sizes. We have the next assumptions:
- (1)
Homoscedasticy (equality of variance, Levene’s test [
12] is often used to verify),
- (2)
Errors are i.i.d with normal distribution and zero covariance,
- (3)
.
The null and the alternative hypothesis for the two-tailed test are
The null and the alternative hypothesis can be written in another way (Theorem 11.2.5, [
10]):
where
. We have some notation here:
The test statistic can be of the form
and has Student’s distribution with
degrees of freedom. If we apply LRT we get rejection region at level
If we denote this statistic by
Ta, i.e.,
from the union-intersection methodology follows that if we could reject for any
a, we could reject for the a that maximizes
Ta. It means that we reject
H0 if ([
10]):
Maximizing
Ta is equivalent to maximizing
Ta2. We have (see [
10]):
According to Theorem 11.2.8 [
10] we have
where
and
. Under the assumptions,
has
F distribution with
and
degrees of freedom. Finally, we get rejection region
which is shown is
Figure 4.
4.5. Hypothesis Testing in Regression Analysis
In the classical linear regression model we have ([
11]):
where
Y ∈ ℝⁿ is the vector of dependent observations,
X ∈
is a matrix of independent observations, β ∈
is the vector of unknown regression coefficients, and the errors
ε are assumed to satisfy:
Under these assumptions, the ordinary least squares (OLS) estimator is given by ([
11,
13]):
and it holds that
4.5.1. Testing Individual Coefficients
To test the null hypothesis concerning a single regression parameter,
against the alternative
H1:
we use the
t-statistic
where
denotes the
j-th diagonal element of
and
Under H0, tⱼ follows Student’s t-distribution with (n − p) degrees of freedom, i.e., tⱼ. The null hypothesis is rejected at significance level α if |tⱼ| >
4.5.2. Joint Hypothesis Testing
More generally, we may test a linear restriction on parameters of the form
where R is a
q ×
p restriction matrix of rank
q.
The test statistic is the
F-statistic, given by ([
11]):
Under the null hypothesis, it holds
F. A common special case is testing the overall significance of the model:
which corresponds to R = [0
px1 | Iₚ]. In this case, the
F-statistic reduces to
where SSR is the regression sum of squares and SSE is the error sum of squares.
4.5.3. Connection to Likelihood Ratio Tests
For Gaussian errors, the
t- and
F-tests coincide with the likelihood ratio tests (LRTs). The LRT statistic for testing
H0: Rβ = r, is (see [
13,
14,
15]):
and it can be shown that
which is a monotonic function of
F; hence both tests yield equivalent decisions under normality.
4.5.4. Extensions
In the presence of heteroscedasticity or autocorrelation, OLS-based tests lose validity. In such cases, robust standard errors or generalized least squares (GLS) are used. For nonlinear models, the Wald, Lagrange Multiplier (Score), and Likelihood Ratio tests provide asymptotically equivalent inference under regularity conditions.
4.6. p-Values and Significance
The p-value is the probability, under the assumption that the null hypothesis is true, of obtaining a test statistic at least as extreme as the one observed. If p-value ≤ α, we reject H0, indicating statistically significant evidence against it. A small p-value does not measure the size or importance of an effect. It merely suggests incompatibility between the data and the null hypothesis.
4.6.1. Neyman–Pearson Framework vs. Fisher Testing
Two philosophical approaches have historically shaped hypothesis testing:
Fisher’s approach [
1,
16,
17]: Introduced the concept of the
p-value and emphasized significance as evidence against
H0, without requiring an alternative.
Neyman–Pearson approach [
2,
18]: Two hypotheses are assumed to represent the whole population. Testing is a decision-making process between two hypotheses, with explicit control of error rates. For more details see
Table 2.
Both approaches are still used today, often in hybrid form (for conceptual blurring between
p-values and significance level see [
19]).
4.6.2. Critical Comparison of Fisher and Neyman–Pearson Paradigms
Building on the formal distinctions outlined above, we now critically examine the conceptual and practical implications of these two paradigms. Statistical hypothesis testing is traditionally grounded in two distinct frameworks: Fisher’s significance testing and the Neyman–Pearson decision-theoretic paradigm. Although often combined in applied work, these approaches differ conceptually and mathematically, with important implications for modern statistical practice.
Fisher’s framework interprets the p-value as a continuous measure of evidence against the null hypothesis, defined as a tail probability under the null distribution. Its main advantage lies in its flexibility and evidential interpretation, making it useful in exploratory analyses. However, Fisher’s approach does not explicitly specify an alternative hypothesis nor control Type II error, limiting its suitability for formal decision-making and power optimization.
The Neyman–Pearson paradigm, by contrast, formulates hypothesis testing as a binary decision problem between competing hypotheses with pre-specified error rates. It provides a rigorous mathematical foundation through concepts such as rejection regions, power functions, and optimality results, including the Neyman–Pearson lemma. Nevertheless, its strict reliance on fixed significance levels and dichotomous conclusions can be restrictive in complex, data-driven settings.
In modern applications—such as high-dimensional inference, multiple testing, and machine learning—neither paradigm alone fully addresses challenges related to multiplicity, dependence, and model uncertainty. Consequently, contemporary practice increasingly adopts hybrid approaches, integrating classical testing with resampling methods, false discovery rate control, Bayesian inference, and estimation-based reporting.
The ongoing p-value controversy reflects these foundational tensions between evidential and decision-based perspectives. Alternatives such as confidence intervals, Bayes factors, and decision-theoretic criteria aim to provide richer and more transparent measures of statistical evidence. A clear understanding of the strengths and limitations of both paradigms remains essential for the appropriate application and further development of hypothesis testing methods.
5. Advanced and Alternative Methods in Hypothesis Testing
Classical hypothesis testing methods rely on specific assumptions such as normality, homogeneity, and large sample sizes. In practice, however, these assumptions are often violated. Furthermore, as data become more complex and high-dimensional, traditional methods can lose accuracy or interpretability. This section presents several advanced and alternative methods that aim to overcome the limitations of classical testing.
5.1. Bayesian Hypothesis Testing
In Bayesian hypothesis testing, we start with a prior—our initial belief about the plausibility of a hypothesis before observing any data. When new data become available, we evaluate the likelihood, which tells us how probable the observed data is under each hypothesis. Using Bayes’ theorem, we then update our belief and obtain the posterior, which reflects the revised probability of the hypothesis after incorporating the data.
Unlike the frequentist approach, which assesses the probability of data given a hypothesis, Bayesian inference evaluates the probability of a hypothesis given the data. Formally, assume that
Suppose we have random sample
X = (
, …,
. We know the distributions
and
According to the Bayes theorem, we get the posterior probabilities of the hypotheses (see [
20,
21,
22]):
The maximum a posteriori test is based on the following idea. We do not reject
if and only if
or
Figure 5 shows the prior distribution for the hypotheses and the likelihood function, which indicates the support of the observed data for each hypothesis. The posterior distribution is obtained by updating the prior with the likelihood.
Beyond formal Bayesian hypothesis tests based on posterior probabilities, Bayesian inference also offers important advantages from a decision-theoretic and estimation perspective. Bayesian approaches provide a conceptually distinct framework for hypothesis testing and uncertainty quantification. Unlike frequentist methods, which rely on asymptotic approximations and long-run error rates, Bayesian inference explicitly models uncertainty about parameters through prior distributions. This allows for regularization and shrinkage in finite samples, which can stabilize estimation in settings with limited data or weak identification. For instance, consider a parametric model with parameter vector θ and sample size n. If the prior mean μ0 is chosen such that ∥ − ∥ = (here is the true, unknown parameter value in the model) the posterior mean may achieve lower quadratic risk than the maximum likelihood estimator due to reduced variance, at the cost of a controlled bias. As n grows, Bayesian and frequentist estimators converge asymptotically, but in small or moderate samples the Bayesian framework can offer more reliable uncertainty quantification.
Furthermore, credible intervals derived from the posterior are not merely Bayesian analogues of confidence intervals; they represent a mechanism for propagating parameter uncertainty in prediction and decision-making. This feature is particularly valuable in modern applications, such as high-dimensional models or machine learning pipelines, where classical asymptotic approximations may be inaccurate. By explicitly incorporating prior information, Bayesian methods can clarify identifying assumptions and provide uncertainty-aware inference, complementing frequentist hypothesis testing and supporting more robust decision-making under limited data conditions.
5.2. Nonparametric Tests
When data violate parametric assumptions, nonparametric tests provide a robust alternative. These methods do not rely on specific distributional assumptions and are often based on ranks rather than raw values. Nonparametric tests are especially useful in small-sample studies or when data are ordinal or heavily skewed.
5.2.1. Goodness-of-Fit Test ( Test)
Let
X = (
, …,
) be a random sample from population with unknown cdf
. The null and the alternative hypothesis for the two-tailed test are (see, for example, [
23]):
The test statistic is of the form
where
R is divided into
r mutually disjoint subsets
. Here,
represents the number of
belongs to
and
under
H0 is true. The rejection region is of the form
5.2.2. Kolmogorov–Smirnov Test (KS Test)
The Kolmogorov–Smirnov test is a nonparametric statistical test that compares empirical distribution functions. It measures the maximum distance between two cumulative distribution functions (CDFs) to determine whether a sample follows a specified distribution (one-sample KS test) or whether two samples come from the same distribution (two-sample KS test).
One-sample KS Test tests whether a sample comes from a specified continuous distribution. The null and the alternative hypothesis are (see [
23]):
H0: The data come from the theoretical distribution F(x).
H1: The data do not come from F(x).
Given a sample of n independent observations
, …,
, we define the empirical cumulative distribution function as:
where I is the indicator function. The KS statistic is:
Under the null hypothesis (assuming a fully specified continuous distribution), the distribution of
is known. Rejection region is {
We get value
d from KS tables. For large
n, the test statistic can be compared to critical values from the Kolmogorov distribution:
where
A two-sample KS Test tests whether two independent samples are drawn from the same (unspecified) continuous distribution. The null and the alternative hypothesis are
H0: The two samples come from the same continuous distribution.
H1: The two samples come from different distribution.
Given two samples
, …,
of size
and
, …,
of size
, the empirical CDFs are:
The two sample KS statistic is [
23]:
The rejection region is {
For large sample sizes, the distribution of the test statistic under
H0 can be approximated using the following scaling:
In the one-sample test, the distribution F(x) must be fully specified (i.e., parameters known). KS test is:
- -
less powerful than parametric tests when their assumptions are met
- -
not optimal for comparing discrete distributions unless modifications are made.
5.2.3. Mann–Whitney U Test: For Independent Samples (Alternative to Two-Sample t-Test)
The Mann–Whitney U test is a nonparametric alternative to the two-sample t-test. It is used to assess whether two independent samples come from the same distribution. This test ranks all data points from both groups together and then compares the sum of ranks between the groups. It is particularly useful when the assumption of normality is violated or when the data are at ordinal scale. Suppose we have two samples: , …, and , …, .
Combine the samples and rank all observations. Let
be the rank of the
i-th observation. The
U statistic is defined as ([
24]):
Under the null hypothesis H0: , the U statistic approximates a normal distribution for large samples.
5.2.4. Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test is the nonparametric counterpart of the paired t-test. It is used for comparing two related or matched samples. The test is based on the ranks of the absolute differences between paired observations, considering the direction of change (positive or negative). It assumes that the differences are symmetrically distributed around the median.
This test is used for paired data
or for a single sample of paired differences (see [
25]):
Discard pairs with
, then rank the absolute differences
be the rank of
and let sign(
be the sign of each difference. The test statistic is:
Often the test uses the sum of positive ranks or the minimum of and . Under
the distribution of
W is known or approximated by the normal distribution for large
n.
5.2.5. Kruskal–Wallis Test
The Kruskal–Wallis test generalizes the Mann–Whitney U test to more than two independent groups. It serves as a nonparametric alternative to one-way ANOVA. The test evaluates whether there is a statistically significant difference between the distributions of three or more groups by analyzing the ranks of the combined data across all groups. It assumes that the observations are independent and come from distributions with the same shape.
Suppose we have
k groups with sizes
and
. Rank all observation from all groups together. Let
be the sum of ranks in group
j. The test statistic is (see [
26]):
Under H0: all samples come from identical distributions, H approximately follows a distribution with k − 1 degrees of freedom.
5.2.6. Friedman Test
The Friedman test is the nonparametric analog of repeated-measures ANOVA. It is used to detect differences in treatments across multiple test attempts when the same subjects are measured under different conditions. Data are ranked within each subject (row), and the test evaluates whether the rank sums differ significantly across conditions (columns).
This test is used for randomized-block across
k conditions with
n subjects. Let
be the sum of ranks for each group or condition. The test statistic is (see [
27]):
Under H0: all treatment effects are equal, the test statistic Q approximately follows a distribution with k − 1 degrees of freedom (for sufficiently large n).
5.3. Resampling-Based Methods
Permutation tests and bootstrap methods are powerful tools that use resampling to estimate the sampling distribution of a test statistic. These methods are particularly valuable when dealing with non-standard test statistics or small sample sizes. It is important to emphasize that in permutation tests, sampling is performed without replacement, whereas in bootstrap methods, sampling is performed with replacement.
5.3.1. Permutation Tests
Permutation test generates the null distribution by randomly shuffling labels in the data. This test (also known as a randomization test or exact test) is a non-parametric method for hypothesis testing that relies on the combinatorial structure of the sample space rather than on asymptotic or distributional assumptions. Let X = (, …, denote a sample from a population, and suppose we want to test a null hypothesis H0 that implies the joint distribution of X is invariant under a certain group of permutations ⊆ Sn, where Sn is the group of all n! permutations of {1, 2, …, n}. Formally, H0: F(, …, ) = F(, …, ) for all π ∈ . Under H0, every rearrangement of the observed data induced by elements of is equally likely.
Let
T(
X) be a test statistic, typically chosen such that large (or small) values provide evidence against
H0. The permutation distribution of
T under
H0 is given by ([
28,
29]):
where
,…,
) and
I(·) is the indicator function. The
p-value for the observed statistic
is then computed as
This p-value represents the proportion of all possible rearrangements of the data that produce a test statistic at least as extreme as the observed one. Intuitively, it quantifies how unusual the observed result is under the null hypothesis, without relying on assumptions like normality. In practice, when || is large, a Monte Carlo approximation is used by randomly sampling a subset of permutations.
Permutation tests have the desirable property of being exact, i.e., the significance level is controlled exactly under H0, without requiring assumptions such as normality or equal variances. Furthermore, they are invariant under transformations that preserve the test statistic and are closely connected to the concept of exchangeability.
5.3.2. Bootstrap Method
Bootstrap method involves sampling with replacement to create “new” datasets. It is used to estimate standard errors, confidence intervals, and p-values.
Here we present the basic ideas of the bootstrap method. Let X = (, …, be a random sample. Based on this sample we generate a large number of bootstrap samples of size n, denoted by . These samples were drawn with replacement from the original sample X = (, …, . Bootstrap distribution of some statistics represents the distribution of values of this statistics in each resample.
We conclude this subsection with a bootstrap test. In two-sample problem, we have realized samples
and
from possibly different probability distributions
and
. The null hypothesis which is tested is
. Let
x be the combined simple which consists of all
n +
m observations. Test statistic can be of the form
Bootstrap algorithm for calculation
p-value is of the form ([
28]):
Draw samples of size with replacement from . Let us denote the first observations and the remaining observations .
Evaluate
on each bootstrap sample:
Here, t is calculated for each bootstrap resample. This generates a distribution of t-values under resampling, approximating how the statistic would vary if the experiment were repeated.
Approximate
p-value with
where
is the observed value of the statistic. The bootstrap
p-value is the fraction of resampled statistics that are at least as extreme as the observed value. It provides an empirical measure of significance without assuming a specific distribution.
The theoretical justification for resampling relies on the convergence of the empirical permutation or bootstrap distribution to the true sampling distribution of the statistic. Monte Carlo approximations, used when the number of possible permutations is large, provide consistent estimates of p-values and preserve the validity of hypothesis tests under general conditions.
6. Hypothesis Testing in High-Dimensional Data
In fields like finance and machine learning, researchers often test thousands of hypotheses simultaneously, leading to an increased risk of Type I errors. In high-dimensional settings, we deal with data where the number of variables
p is large—often larger than the number of observations. Typically, we conduct multiple simultaneous hypothesis tests, one for each variable:
In high-dimensional settings, testing many hypotheses inflates the risk of false positives. Two main strategies are used to control this are:
We divide overall significance level with the number of tests:
Although simple and universally valid, the Bonferroni method is highly conservative for large p, leading to low power—especially when the tests are weakly dependent or sparse signals exist.
- 2.
Benjamini–Hochberg procedure
Benjamini and Hochberg [
30] proposed controlling the expected proportion of false discoveries among the rejected hypotheses—the False Discovery Rate (FDR). Sort the
p-values
. Then define:
Reject all null hypotheses . This measure controls the expected proportion of false positives among rejected hypotheses.
6.1. High-Dimensional Mean Testing
Assume that
, for
We want to test:
The Hotelling’s T
2 test statistic is
where
are sample means,
, η
This test is not feasible when
p >
n. Bai & Saranadasa [
31] or Chen & Qin [
32] tests use simplified statistics like:
where
and τ is a bias-correction constant. Intuitively,
measures the squared distance between the two sample means, adjusted by a bias-correction term
, that accounts for variability in high-dimensional settings. Large values of
suggest a significant difference between the population means.
Under
H0, with suitable assumptions, the statistic
Tn follows an asymptotic normal distribution. Later, Chen and Qin [
32] improved this framework by showing optimal power properties even when the covariance matrix exhibits complex dependence structures (see also [
33]). Building on these classical results, recent theoretical advancements have refined high-dimensional mean testing by introducing methods that account for sparsity, complex dependence structures, and improved resampling-based approximations. These developments provide deeper engagement with contemporary mathematical results and highlight ongoing progress in the theory of
p ≫
n inference.
6.2. Higher Criticism (HC) Test
Donoho and Jin [
34] introduced the Higher Criticism (HC) statistic to detect
sparse alternatives—settings where only a small proportion of the hypotheses are false. Sort
p values from individual tests
. Then define HC statistic:
The HC statistic detects rare signals by comparing observed p-values to their expected uniform distribution. Large values indicate a small number of hypotheses are unusually significant. The HC test magnifies small, systematic deviations between observed p-values and their expected uniform distribution under H0. It is asymptotically powerful in detecting extremely weak and rare signals—a regime where traditional tests fail. Mathematically, the HC framework connects to empirical process theory and can be interpreted as a goodness-of-fit test for uniformity of p-values.
7. Hypothesis Testing in Machine Learning
Statistical hypothesis testing forms the theoretical backbone of rigorous model evaluation and experimental design in machine learning (ML), data science, and online experimentation. While traditional hypothesis testing concerns population parameters, in ML the focus shifts toward evaluating model performance, feature relevance, and treatment effects under uncertainty.
7.1. A/B Testing and Online Experiments
A/B testing is the cornerstone of empirical validation in industry and online systems. The goal is to compare two versions of a product—control (A) and treatment (B)—to test whether a modification produces a statistically significant improvement in a key performance metric, such as conversion rate or user engagement. Let
and
denote performance metrics from the two groups, with sample means
and
and variances
and
. The hypotheses are:
The classical two-sample
t-statistic is:
When sample sizes are large, the test statistic approximately follows the standard normal distribution under H0. In web-scale experiments, sequential testing or Bayesian alternatives such as Thompson sampling are often used to control false discoveries while enabling adaptive experimentation.
7.2. Cross-Validation Significance Testing
Model comparison in machine learning often relies on cross-validation (CV), where data are partitioned into folds for repeated training and testing. Observed performance differences may arise from random data partitioning and training variability; therefore, careful statistical analysis is required to assess whether such differences reflect genuine model improvements.
Suppose two models
and
yield performance metrics across
K cross-validation folds and let
score
. The null hypothesis is:
Since CV fold results are generally dependent, we use corrected resampled
t-tests (see [
35]), Bayesian methods or permutation-based tests to obtain more reliable inference on whether observed performance differences reflect genuine model improvement rather than random variation.
7.3. Permutation and Bootstrap-Based Feature Significance
Permutation and bootstrap approaches offer model-agnostic methods for significance testing of feature importance. Given a fitted model
f and prediction function
, the null hypothesis for a feature
is:
This hypothesis tests whether feature has any predictive power. Repeated permutation or resampling assesses how much the model’s performance changes if is shuffled. We repeatedly permute (B times) while keeping other features fixed and recompute a performance metric . Then we calculate permutation-based p-value as we did in (63).
Permutation tests provide a robust, distribution-free way to assess variable relevance. Such permutation-based inference provides a robust alternative to parametric tests, particularly in nonlinear models such as random forests or neural networks (see [
36]). Bootstrap resampling complements this by quantifying uncertainty in feature importance, particularly useful in ensemble and deep learning models.
From a theoretical perspective, permutation-based feature significance and bootstrap resampling provide rigorous control over type I error and quantify uncertainty in model outputs. Corrected resampled t-tests and related methods allow inference on dependent cross-validation results, ensuring that observed performance differences reflect true signal rather than random variation.
7.4. Multiple Testing and False Discovery Control in ML
Modern ML pipelines often involve evaluating thousands of features or hyperparameters, leading to multiple testing problems. To control the false discovery rate, procedures such those presented in [
30] and knockoff methods (see [
37]) are used. These methods maintain interpretability while ensuring reliable feature selection in high-dimensional spaces.
7.5. Reproducibility and Scientific Integrity
As ML research becomes increasingly empirical, hypothesis testing ensures that improvements are statistically significant and replicable. Recent efforts emphasize combining classical inference with Bayesian and information-theoretic approaches, offering a unified framework for uncertainty quantification in AI systems.
8. Applications Across Scientific Fields
Statistical hypothesis testing has widespread applications in nearly every empirical discipline. While the core principles remain the same, each field adapts hypothesis testing to address domain-specific challenges and data types. This section illustrates how hypothesis testing is applied in practice across medicine, economics, psychology, machine learning, and industry experimentation.
8.1. Medicine and Clinical Trials
In medical research, hypothesis testing is central to the design and evaluation of randomized controlled trials.
Example: If we want to test whether a new drug is more effective than a placebo, then:
H0: There is no difference in recovery rates.
H1: The new drug improves recovery rates.
Tests typically involve:
Two-sample t-tests (for continuous outcomes)
Chi-square tests (for categorical outcomes like event rates)
Log-rank tests (for survival analysis)
Regulatory agencies often require significance levels of α = 0.05 and pre-specified hypotheses for trial approval.
8.2. Economics and Policy Evaluation
Econometrics often uses hypothesis testing to validate theoretical models and evaluate the impact of public policies.
Example: If we want to test the effect of a tax policy on employment rates, then we can:
Use regression analyses tests (e.g., t-tests on regression coefficients)
May involve instrumental variables or difference-in-differences approaches
Hypothesis testing here is used to separate causality from correlation.
8.3. Psychology
In psychology, hypothesis development and testing is a structured method for investigating human behavior. It starts with observations that raise questions, such as whether caffeine improves alertness. Researchers formulate multiple hypotheses, generate predictions from them, and then collect data to determine which hypothesis best matches the observed results. Predictions must logically follow from the hypotheses and be testable.
In psychology we have: (1) publication bias: preference for statistically significant results and (2) low statistical power due to small sample sizes.
Statistical reform in psychology is a driving force behind more rigorous application of hypothesis testing.
8.4. Machine Learning and Data Science
Although machine learning often focuses on prediction rather than inference, hypothesis testing is increasingly used to evaluate models, compare algorithms, and understand features.
Applications: A/B testing, permutation tests, cross-validation comparisons
Hypothesis testing bridges the gap between black-box model outputs and actionable insights.
8.5. Industry and Business Analytics
In tech companies and digital marketing, hypothesis testing underpins experimental design for product development, ad optimization, and user engagement.
Example: If we want to test whether a new button color increases click-through rate, then we can:
H0: No difference between old and new versions.
H1: New version is better than the older.
Tools typically involve:
z-test for proportions, confidence intervals, online platforms, sequential testing, Bayesian monitoring.
9. Challenges, Misconceptions, and Modern Reform Movements
While statistical hypothesis testing remains foundational in empirical research, its widespread use has led to numerous misunderstandings, misuses, and systemic problems in scientific publishing. In recent years, these issues have sparked reform efforts across disciplines aiming to improve the reliability, transparency, and reproducibility of scientific findings.
9.1. Misinterpretation and Misuse of p-Values
Formal definition of the
p-value is:
Let T be the test statistic and let be its realized value. Then
for right tailed test
for left tailed test
for two tailed test
The p-value is frequently misunderstood:
A p-value does not indicate the probability that the null hypothesis is true.
A p-value does not measure the size or importance of an effect.
A small p-value does not guarantee that a result is statistically significant.
Misinterpretations have led to widespread misuse and “p-hacking,” where researchers consciously or unconsciously manipulate data, models, or analysis choices to achieve statistical significance (typically p < 0.05).
9.2. P-Hacking and Publication Bias
P-hacking refers to any strategy used to obtain statistically significant results, including:
Selective reporting of variables
Optional stopping (collecting more data until significance is reached)
Multiple testing without correction
Excluding “outliers” post hoc
Publication bias further amplifies this by favoring significant results over null findings, creating a distorted scientific record.
Figure 6 illustrates
p-hacking strategies, showing the frequency of
p-values across repeated analyses and how selective reporting can inflate the occurrence of apparently significant
p-values near 0.05.
9.3. The Replication Crisis
A large-scale effort by the Open Science Collaboration [
38] attempted to replicate some studies from high-impact psychology journals. Key findings:
Replication rate was 25% for social psychology
Replication rate was 50% for cognitive psychology
The average effect size was about half the original (note that effect size measures express the practical or clinical significance of differences among sample means, as contrasted with the statistical significance of the differences)
The reproducibility rate varied by field, method, and study quality
This crisis is not unique to psychology—it extends to economics, medicine, and even machine learning research.
9.4. ASA Statement and the “Beyond p < 0.05” Movement
In 2016, the American Statistical Association (ASA) issued a historic statement (see [
39]) warning against overreliance on
p-values: “A
p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”
The ASA encouraged researchers to:
Focus on effect sizes, confidence intervals, and context
Avoid binary “significant/non-significant” thinking
Embrace transparency, pre-registration, and open science
This gave rise to the “Beyond
p < 0.05” (see [
40]), which advocates for a broader and more nuanced inferential framework. Key recommendations include:
Estimation-based inference
Bayesian methods
Decision-theoretic approaches
Registered reports and reproducible workflows
9.5. Recommended Reforms and Best Practices
To address the well-documented limitations of null hypothesis significance testing and the overreliance on
p-values, several methodological reforms have been proposed and increasingly adopted across the social, behavioral, and biomedical sciences. These reforms are presented in
Table 3 and aim to improve transparency, reproducibility, and the interpretability of statistical evidence (see [
41]).
10. Conclusions and Future Directions
Statistical hypothesis testing remains one of the central methods in inferential statistics and in many empirical sciences. From the early ideas of Fisher, through the work of Neyman and Pearson, to modern approaches that include Bayesian statistics and methods for high-dimensional data, the concept of hypothesis testing has gradually developed into a framework that allows researchers to quantify uncertainty and make data-based conclusions. The classical approach—relying on null and alternative hypotheses, p-values, and error control—still forms the basis for interpreting results in fields such as medicine, economics, psychology, and machine learning.
However, the increasing complexity of data and the growth of large-scale studies have exposed limitations of traditional methods. Overreliance on fixed significance thresholds, widespread misuse of p-values, and challenges in reproducibility have prompted a methodological and philosophical re-examination of hypothesis testing. Modern research emphasizes effect sizes, confidence intervals, Bayesian evidence, and the integration of prior information, promoting a shift from binary decision-making toward a more nuanced understanding of statistical evidence.
Future research in hypothesis testing is expected to proceed along several key directions.
High-dimensional and complex data inference: The rise of big data and high-dimensional statistics calls for new test statistics that remain powerful and computationally efficient, where traditional asymptotics fail. Techniques such as random matrix theory, and resampling-based inference will continue to expand.
Integration with machine learning: As models become increasingly nonlinear and complex, there is a need for frameworks that allow hypothesis testing in “black-box” settings. Challenges include testing feature importance, model interpretability, and validation of deep learning results.
Reproducibility and transparency: The replication crisis across multiple disciplines has strengthened the call for more transparent workflows. Pre-registration of research plans, open data, and clearly defined analysis procedures are becoming increasingly standard practice.
Bayesian and decision-theoretic frameworks: Bayesian hypothesis testing offers a coherent probabilistic interpretation of evidence, enabling sequential analysis, adaptive experimentation, and model comparison. Expanding these approaches for large-scale and hierarchical models represents an important avenue for development.
Ethical and interpretative dimensions: As hypothesis testing continues to influence policy, medicine, and AI systems, future discussions must address the epistemological and ethical implications of statistical inference—ensuring that statistical significance aligns with practical significance and social impact.
In conclusion, while statistical hypothesis testing has undergone a century of refinement, its evolution is far from complete. The field stands at the intersection of classical rigor and computational innovation, striving for methods that are not only mathematically sound but also transparent, reproducible, and contextually meaningful. The future of hypothesis testing lies in unifying theoretical advances with responsible data practice—building a statistically literate and trustworthy foundation for the sciences of tomorrow.