1. Introduction
Testing independence has been popularly applied in the association analysis of two-way contingency tables from cross-sectional studies and other statistical applications. Ref. [
1] investigated the association between levels of paternal education (completed university, partially completed university, completed secondary education, and not completed secondary education) and quartiles of neonatal weight gain (
: lowest
;
: second lowest
;
: second highest
; and
: highest
) from a cross-sectional study involving 13,262 Belarusian infants born at or over 37 weeks of gestation and weighing at or over 2500 g.
Table 1 provides the observed frequencies and the expected frequencies under independence (in parentheses).
With an observed statistic value of and nine degrees of freedom, the p-value for testing the independence between levels of paternal education and quartiles of neonatal weight gain is . At the level of significance, the data reject the independence of the levels of paternal education and the quartiles of neonatal weight gain. The nature of a dependence is usually revealed by the distribution of differences between the observed and expected frequencies, i.e., residuals of independence. For example, the number of subjects with the highest neonatal weight gain and partial university paternal education exceeds the expected number, while the number of subjects with the highest neonatal weight gain and secondary paternal education subceeds the expected value.
Since the analytical form of the distribution of the residuals of independence is not available, the first four moments of the distribution provide vital information about the center, spread, skewness, and kurtosis of the distribution. For example, standardized residuals of independence are commonly used to reduce the heterogeneity from cell to cell. Standardization is usually carried out with the asymptotic mean and variance of the residuals. Non-asymptotic and explicit expressions of the mean and variance of the residuals of independence seem to be missing in the literature. This paper explicitly derives the first four raw moments of the residuals of independence under a multinomial model.
2. Main Results
Consider the following
table:
| 1 | 2 | ⋯ | | c | Total |
1 | | | ⋯ | | | |
2 | | | ⋯ | | | |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| | | ⋯ | | | |
r | | | ⋯ | | | |
Total | | | ⋯ | | | |
The residual of independence of cell is defined as Assume that follows a multinomial distribution with trials and a probability of for cell for and where i.e.,
The following factorial moments of a multinomial distribution are taken from [
2] and can be proven straightforwardly.
Lemma 1. Assume that For any nonnegative integer m and random variable let Then, for nonnegative integers we haveIn particular,
The next result is taken from [
3] and can be proven directly from the definition of multinomial distribution.
Lemma 2. Assume that and is a set partition of Let Then,
The mean and variance of the residuals of independence are given below.
Theorem 1. Assume that where is a constant, for and For any and consider the residual of independence of cell where and We have, for and where and When independence holds, i.e., Proof. For any
and
note that
Recall that
so we have
To calculate the variance of
let
and
Then,
For any
and
The second moment can be obtained by noting
When independence holds, i.e., it is straightforward to see that and □
Based on the exact variance of the residuals of independence above, the standardized residual of independence of cell
is
This exact standardized residual is asymptotically equivalent to
, which is used in many textbooks, e.g., [
4].
In order to derive the third and fourth moments of we need higher-order mixed moments. However, the derivation of higher-order mixed moments from higher-order factorial moments in Lemma 1 is too tedious. Using the differential relationships between the moment-generating function of a distribution and its moments as well as the computer algebra system Wolfram|Alpha, we obtain the following Lemma 3.
Lemma 3. Assume that For any nonnegative integer let Then, Proof. Since
its moment-generating function is
The results are obtained by noting that
for
and nonnegative intergers
and
□
Theorem 2 next provides the explicit expressions of the third and fourth moments of the residuals of independence.
Theorem 2. Assume that where is a constant, for and For any and consider the residual of independence of cell where and We have, for and and where for any nonnegative integer Proof. Let
and
Then,
Since
we have, from Lemma 3,
The result is obtained by noting
The fourth moment is obtained from Lemma 3 and the following:
□
The exact third and fourth central moments can be derived straightforwardly by noting and Note that the first four cumulants of a distribution are its mean, variance, third central moment, and fourth central moment minus three times the squared variance. We can also obtain the exact first four cumulants of the distribution of the residuals of independence. Corollary 1 gives explicit expressions for the third and fourth central moments as well as the fourth cumulant.
Corollary 1. Under the conditions of Theorem 2, the third central moment of is The fourth central moment of is The fourth cumulant of iswhere and for any nonnegative integer 3. Conclusions
We have explicitly derived the first four moments of the residuals of independence in a two-way contingency table under a multinomial model. From these exact moments, we have the exact skewness, and kurtosis, of the distribution of the residuals of independence. These explicit but tedious results provide us with the vital statistical characteristics of the exact distribution of the residuals of independence in the association analysis of two-way contingency tables. Moreover, since the joint probability distribution of independent Poisson random variables, depending on their sum, is a multinomial distribution, these exact results can also be used in the residual analysis of log-linear models. Higher-order raw moments of the residuals of independence can be found similarly, but the results are more complicated.
Currently, most residual diagnostics of discrete data depend on large-sample methods. When sample sizes are not large or data are sparse, diagnostic results based on large-sample theory are debatable, and exact methods or methods based on non-asymptotic theory are desirable. The explicit moments of the residuals of independence contribute to exact residual diagnostics significantly. More discussions of and references to the exact analysis of discrete data are given in [
5].