1. Introduction
Hardy–Weinberg Equilibrium (HWE) plays an important role in the field of the population genetics and related scientific domains. HWE is a common assumption in many areas of research so that assessing the compatibility of observed genotype frequencies with HWE is a basic step of a complete statistical analysis. There are two main approaches to this undertaking: goodness of fit tests and equivalence tests.
A vast amount of literature exists on the goodness of fit tests for HWE, which includes application of the asymptotic
and likelihood ratio tests. The specific exact goodness of fit tests for HWE are developed in [
1,
2,
3,
4,
5] among others. The null hypothesis of all these tests is that the underlying population is exactly in HWE. Hence, the goodness of fit tests are tailored to establish lack of compatibility with HWE.
The equivalence tests are appropriate to establish sufficiently good agreement of the observed genotype frequencies with HWE. The exact and approximate equivalence tests for the biallelic case are developed recently in [
6,
7,
8]. To our best knowledge, there are not any equivalence tests for HWE and multiple alleles. Two different equivalence tests are developed in this paper for the case of multiple alleles. The tests can be carried out using the asymptotic approximation or bootstrap method.
A distribution of diploid genotypes at a k-allele locus can be represented as a lower triangular matrix p, where is the probability of the genotypes with alleles i and j. Let denote the allele distribution under p. The probability of the allele i under p can be calculated as . If the population is in HWE, then the genotype distribution fulfills the conditions for and . Let denote the genotype distribution under the assumption of HWE, which is implied by the allele distribution .
Euclidean distance
can be considered a conditional distance between the genotype distributions
p and
under the joint allele distribution
. The equivalence test problem is then defined by
where
is a tolerance parameter.
Let
denote the family of all possible genotype distributions at HWE. The minimum distance between
p and
is defined by
. The corresponding equivalence test problem is given by
We observe the genotype frequencies
of the sample size
n. The natural test statistic for (
1) is
which can be easily computed. The appropriate test statistic for (
2) is
which requires optimization for the calculation of
. The test statistic
can be considered a numerically efficient approximation to
because of
. The subscript * will be used instead of
c and
m in the reminder of the paper, if statements are appropriate for both cases.
If Hypothesis (
1) or (
2) of the non-equivalence can be rejected for some appropriate value of
then the true underlying genotype distribution is close to HWE with the probability greater than
, where
is the nominal level of the test. The appropriate value of
depends on the application and the available sample size. The value of the parameter
can be found by simulation as shown in
Section 3.2. Alternatively, the minimum tolerance parameter
, for which
can be rejected, can be computed and reported, see
Section 2 for details.
2. Equivalence Tests
In this section, we derive the asymptotic distributions of the test statistics and . We provide also an algorithm for the asymptotic and bootstrap-based tests.
Let
v be the usual bijective mapping of the matrix
p to the vector
. Let
denote the derivative of the function
, where
q is a vector of length
. The derivative
can be derived using the chain rule. Let
fulfill the boundary condition
and let
. Then the asymptotic distribution of
under
is Gaussian with mean zero and variance
, where
is a covariance matrix and
is a square diagonal matrix, whose diagonal entries are elements of
q. The proof of the statement can be found in [
9].
The test statistic
converges weakly under the assumption that there exists a continuous function
h on an open neighborhood of
such that
and
. The existence of a continuous minimizer
h is also an important requirement for the numerical computation of
. We assume the existence of a continuous minimizer
h on an open neighborhood of
for the reminder of the paper. Let
denote the derivative of the function
. Then the asymptotic distribution of
under
is Gaussian with mean zero and variance
, see [
10] for details.
The asymptotic variance is unknown and can be estimated by . The asymptotic test can be carried out as follows:
- (1)
Given are the genotype frequencies , the tolerance parameter and the significance level .
- (2)
Compute the tests statistic .
- (3)
Estimate the asymptotic variance by .
- (4)
Reject if , where is the lower -quantile of the normal distribution.
The minimum tolerance parameter , for which the asymptotic test can reject , can be computed as or correspondingly.
To improve the finite sample performance of the proposed tests, the bootstrap method is applied to estimate the variance of
, see [
11], Section 6 for details. The estimator
is then replaced by the bootstrap estimator of the variance. Otherwise, everything stays the same.
4. Summary
Two different test statistics are proposed to establish equivalence of the genotype distributions to HWE. The critical values of the tests are calculated using the asymptotic approximation by the normal distribution. The variance of the test statistic is estimated asymptotically or by the bootstrap method. The minimum tolerance parameter
, for which
can be rejected, is derived. The tests are successfully applied to three real data sets, which are frequently considered in the literature. The test power at HWE and the type I error rates are studied at a large number of points, which are inspired by the real data sets. The asymptotic tests have anti-conservative tendencies and should be used with caution. The bootstrap-based tests are sufficiently conservative for the most practical situations. If more conservative tests are required then the nominal level may be halved or the tolerance parameter
may be reduced. We recommend to perform all proposed tests in any case and compare the results. The appropriate value of
depends on the application and the available sample size. The reasonable values of the parameter
can be found by simulation as shown in
Section 3.2. Additionally, the rejection probabilities at the close random boundary points may be studied as shown in
Section 3.3.