Contingency Table Analysis and Inference via Double Index Measures

In this work, we focus on a general family of measures of divergence for estimation and testing with emphasis on conditional independence in cross tabulations. For this purpose, a restricted minimum divergence estimator is used for the estimation of parameters under constraints and a new double index (dual) divergence test statistic is introduced and thoroughly examined. The associated asymptotic theory is provided and the advantages and practical implications are explored via simulation studies.


Introduction
The concept of distance or divergence is known since at least the time of Pearson, who, in 1900, considered the classical goodness-of-fit (gof) problem by considering the distance between observed and expected frequencies. The problem for both discrete and discretized continuous distributions have been in the center of attention for the last 100+ years. The classical set-up is the one considered by Pearson where a hypothesized m-dimensional multinomial distribution, say Multi(N, p 1 , . . . , p m ) is examined as being the underlying distributional mechanism for producing a given sample of size N. The problem can be extended to examine the homogeneity (in terms of the distributional mechanisms) among two independent samples or the independence among two population characteristics. In all such problems we are dealing with cross tabulations or crosstabs (or contingency tables). Problems of such nature appear frequently in a great variety of fields including biosciences, socio-economic and political sciences, actuarial science, finance, business, accounting, and marketing. The need to establish for instance, whether the mechanisms producing two phenomena are the same or not is vital for altering economic policies, preventing socio-economic crises or enforcing the same economic or financial decisions to groups with similar underlying mechanisms (e.g., retaining the insurance premium in case of similarity or having different premiums in case of diversity). It is important to note that divergence measures play a pivotal role also in statistical inference in continuous settings. Indeed, for example, in [1] the authors investigate the multivariate normal case while in a recent work [2], the modified skew-normal-Cauchy (MSNC) distribution is considered, against normality.
Divergence measures can be used for estimating purposes by minimizing the associated measure. The classical estimating technique is the one where (1) we take α = 0 and Φ(x) = Φ KL (x). Then, the resulting KL minimization is equivalent to the classical maximization of the likelihood producing the well-known Maximum Likelihood Estimator (MLE, see ( [9], Section 5.2)). In general, the minimization with respect to the parameter of interest of the divergence measure, gives rise to the corresponding minimum divergence estimator (see, e.g., [6,10,11]). For the case where constraints are involved the case associated with Csiszar's family of measures was recently investigated [12]. For further references, please refer to [13][14][15][16][17][18][19][20][21].
Consider the hypothesis where p is the vector of the true but unknown probabilities of the underlying distribution and p(θ 0 ) the vector of the corresponding probabilities of the hypothesized distribution which is unknown and falls within the family of P with the unknown parameters satisfying in general, certain constraints, e.g., of the form c(θ) = 0, under which the estimation of the parameter will be performed. The purpose of this work is twofold: having as a reference the divergence measure given in (1), we will first propose a general double index divergence class of measures and make inference regarding the parameter estimators involved. Then, we proceed with the hypothesis problem with the emphasis given to the concept of conditional independence. The innovative idea proposed in this work is the duality in choosing among the members of the general class of divergences, one for estimating and one for testing purposes which may not be necessarily, the same. In that sense, we propose a double index divergence test statistic offering the greatest possible range of options, both for the strictly convex function Φ and the indicator value α > 0. Thus, the estimation problem can be examined considering expression (1) using a function Φ 2 ∈ F * and an indicator α 2 > 0: the minimization of which with respect to the unknown parameter, will produce the restricted minimum (Φ 2 , α 2 ) divergence (rMD) estimator for some constraints c(θ) = 0. Observe that the unknown vector of underlying probabilities has been replaced by the vector of the corresponding sample frequenciesp. Then, the testing problem will be based on where Φ 1 (·) and α 1 may be different from the corresponding quantities used for the estimation problem in (4). Finally, the duality of the proposed methodology surfaces when the testing problem is explored via the dual divergence test statistic formulated on the basis of the double-α-double-Φ divergence given by where Φ 1 , Φ 2 ∈ F * and α 1 , α 2 > 0. The remaining parts of this work are: Section 2 presents the formal definition and the asymptotic properties of the rMD estimator (rMDE). Section 3 deals with the general testing problem with the use of rMDE. The associated set up for the case of three-way contingency tables is developed in Section 4 with a simulation section emphasizing on the conditional independence of three random variables. We close this work with some conclusions.

Restricted Minimum (Φ, α)-Power Divergence Estimator
In what follows, we will provide the formal definition and the expansion of the rMD estimator and prove its asymptotic normality. The assumptions required for establishing the results of this section for the rMD estimator under constraints, are provided below: . . , f ν (θ) are the constrained functions on the s-dimensional parameter θ, f k (θ) = 0, k = 1, . . . , ν and ν < s < m − 1; (A 1 ) There exists a value θ 0 ∈ Θ, such that X = (X 1 , . . . , X m ) ∼ Multi(N, p(θ 0 )); (A 2 ) Each constraint function f k (θ) has continuous second partial derivatives; (A 3 ) The ν × s and m × s matrices In order to derive the decomposition ofθ r (Φ,α) the Implicit Function Theorem (IFT) is exploited according to which if a function has an invertible derivative at a point then itself is invertible in a neighbourhood of this point but it cannot be expressed in closed form [23].
The theorem below establishes the asymptotic normality of rMDE which is a straightforward extension of Theorem 2.4 [11] since by the Central Limit Theorem we know that with the asymptotic variance-covariance matrix Σ p(θ 0 ) given by diag(p(θ 0 )) − p(θ 0 )p(θ 0 ) .

Theorem 2.
Under Assumptions (A 0 )-(A 5 ), by (11) and for W(θ 0 ) given in (10), the asymptotic distribution of rMDE is the s-dimensional Normal distribution given by

Remark 1.
The proposed class of estimators forms a family of estimators that goes beyond the indicator α since it is easy to see that estimators obtained for the Csiszar's ϕ family are given for α = 0 in (1) and also the standard equiprobable model.

Statistical Inference
In this section, we introduce the double index divergence test statistic with Φ 1 , Φ 2 ∈ F * and α 1 , α 2 > 0 and make the additional assumptions by which we focus on the Csiszar's family of measures for testing purposes (the notation ϕ is used for clarity) and the equiprobable model: The Theorem below provides the asymptotic distribution of (12) under Assumptions (A 0 )-(A 7 ). Assumption (A 7 ) will be later relaxed and a general asymptotic result will be presented in the next subsection. A discussion about Assumption A 6 will also be made in the sequel.
Proof. It is straightforward that which by Theorem 2, expression (11), and for M(θ 0 ) = J(θ 0 )W(θ 0 ) reduces to Combining the above we obtain The expansion of d ϕ (p, q) around (p(θ 0 ), p(θ 0 )) yields Then, under A 7 , T(θ 0 ) (see (14)) is a projection matrix of rank m − 1 − s + ν since the trace of the matrices respectively. Then, the result follows from the fact (see ( [24], p. 57)) that X X has a chi-squared distribution with degrees of freedom equal to the rank of the variance-covariance matrix of the random vector X as long as it is a projection matrix.

Remark 2.
Relaxation of Assumption (A 6 ): Arguing as in [11], when the true model is not the equiprobable the result of Theorem 3 holds true as long as α 2 = 0 and approximately true when α 2 → 0.

Asymptotic Theory of the Dual Divergence Test Statistic
Having established the two main results of the work, namely the decomposition of the proposed restricted estimator (Theorem 1) together with its asymptotic properties (Theorem 2), as well as the asymptotic distribution of the associated test statistic under the class of Csiszar ϕ-functions (Theorem 3) we continue below extended in a natural way the results of [11] for the dual divergence test statistic. The extensions presented in this section are considered vital due to their practical impication on cross tabulations discussed in Section 4. The proofs will be omitted since both results (Theorems 4 and 5) follow along the lines of previous results (see Theorems 3.4 and 3.9 of [11]). In what follows we adopt the following notation:

Remark 3.
Consider the case where Assumption (A 6 ) is relaxed. Then, the asymptotic distribution of the test statistic T as long as α 2 = 0 or α 2 → 0. For further elaboration of this remark we refer to [11].

Remark 4.
Observe that if α 1 → 0 then b → 1 and the asymptotic distribution becomes χ 2 k , while for α 1 away from 0 the distribution is proportional to χ 2 k with proportionality index b = 1. However, for not equiprobable models these statements hold true as long as α 2 is close to zero.
Consider now the hypothesis with contiguous alternatives [25,26] where d is an m-dimensional vector of known real values with components d i satisfying the assumption ∑ m i=1 d i = 0. Observe that as N tends to infinity, the local contiguous alternative converges to the null hypothesis at the rate O(N 1/2 ). Alternatives, such as those in (16), are known as Pitman transition alternatives or Pitman (local) alternatives or local contiguous alternatives to the null hypothesis H 0 [25].

Remark 5.
Observe that under Assumption (A 6 ) (p i = 1/m) the asymptotic distribution is independent of Φ, α 1 and α 2 . As a result the associated power of the test is Pr(

Cross Tabulations and Dual Divergence Test Statistic
In this section, we try to take advantage of the methodology proposed earlier for the analysis of cross tabulations. In particular we focus on the case of three categorical variables, say X, Y, and Z with corresponding, I, J, and K. Then, assume that the probability mass of a realization of a randomly selected subject is denoted by p ijk (θ) = Pr(X = i, Y = j, Z = k) > 0, where here and in what follows i = 1, . . . , I, j = 1, . . . , J, k = 1, . . . , K unless otherwise stated. The associated probability vector is given as In this set up the dual divergence test statistics is given as wherep ijk as above and the rMD estimator aŝ For α 1 , α 2 = 0 and special cases of the functions Φ 1 and Φ 2 , classical restricted minimum divergence estimators and associated test statistics can be derived from (18) and (17), respectively. For example, for α 1 , α 2 = 0, and Φ 1 , Φ 2 = Φ KL the likelihood ratio test statistic with the restricted maximum likelihood estimator (G 2 (θ r )) can be derived, while for Φ 1 , Φ 2 = Φ λ and λ = 1 we obtain the chi-squared test statistic with the restricted minimum chi-squared estimator (X 2 (θ r X 2 )). For Φ 1 , Φ 2 = Φ λ and λ = 2/3 the dual divergence test statistic reduces to the power divergence test statistic with the restricted minimum power divergence estimator (CR(θ r CR )) whereas for λ = −1/2 reduces to the Freeman-Tukey test statistic with the restricted minimum Freeman-Tukey estimator (FT(θ r FT )).

Remark 6.
For practical purposes, the choice of the values of the indices is motivated by the work of [8] where, in an attempt to achieve a compromise between robustness and efficiency of estimators, they recommended the use of small values in the (0, 1) region. In the following subsection, our analysis will reconfirm their findings since as it will be seen, values of both indices close to (0) (than to one (1)) will be found to be associated with a good performance not only in terms of estimation but also in terms of goodness of fit as it will be reflected in the size and the power of the test.

Simulation Study
In this simulation study, we use the rMD estimator and the associated dual divergence test statistic for the analysis of cross tabulations. Specifically, we are going to compare in terms of size and power classical tests with those that can be derived through the proposed methodology, for the problem of conditional independence of three random variables in contingency tables. We test the hypothesis of conditional independence for a 2 × 2 × 2 contingency table, thus in this case we have m = 8 probabilities of the multinomial model, s = 7 unknown parameters to estimate and two constraint functions (ν = 2) which are given by f 221 (θ) = θ 111 θ 221 − θ 121 θ 211 and f 222 (θ) = θ 112 For a better understanding of the behaviour of the dual divergence test statistic given in (17) we compare it with the four classical tests-of-fit mentioned earlier in Section 4, is applied for Φ 1 = Φ α 1 , Φ 2 = Φ α 2 and six different values of α 1 and α 2 , α 1 , α 2 = 10 −7 , 0.01, 0.05, 0.10, 0.50, and 1.50. Note that, the critical values used in this simulation study, are the asymptotic critical values based on the asymptotic distribution bχ 2 2 with b as in (15) for the double index family of test statistics, and the χ 2 2 for the classical test statistics. For the analysis we used 100,000 simulations and sample sizes equal to n = 20, 25 (small sample sizes) and n = 40, 45 (moderate sample sizes).
For w = 0 we take the model under the null hypothesis of conditional independence while for values w = 0 we take the models under the alternative hypotheses. We considered the following values of w = 0.00, 0.30, 0.60, and 0.90. Note that the larger the value of w the more we deviate from the null model. For the simulation study, we used the R software [28], while for the constrained optimization the auglag function from the nloptr package [29].
From Table 1, we can observe that in terms of size the performance of the T ) is adequate for values of α 1 , α 2 ≤ 0.5 both for small and moderate sample sizes. In addition, we can see that for α 1 ≤ 0.10, T α 1 Φ 1 (θ r (Φ 2 ,α 2 ) ) appears to be liberal while for α 1 ≥ 0.5 appears to be conservative. We also note that the size becomes smaller as α 1 and α 2 increase with α 1 ≥ α 2 . Table 2 provides the size of the classical tests-of-fit from where we can observe that CR(θ r CR ) has the best performance among all competing tests for every sample size. In contrast, FT(θ r FT ) has the worst performance among all competing tests and appears to be very liberal. Furthermore, X 2 (θ r X 2 ) appears to be conservative while G 2 (θ r ) appears to be liberal. Note that for α 1 ∈ [0.01, 0.5] and α 2 ≤ 0.10, T α 1 Φ 1 (θ r (Φ 2 ,α 2 ) ) behaves better than the G 2 (θ r ) test statistic and its performance is quite close to the performance of the X 2 (θ r X 2 ). In order to examine the closeness of the estimated (true) size to the nominal size α = 0.05 we consider the criterion given by Dale [30]. The criterion involves the following inequality where logit(p) = log(p/(1 − p)) andα n is the estimated (true) size. The estimated (true) size is considered to be close to the nominal size if (19) is satisfied with d = 0.35. Note that in this situation the estimated (true) size is close to the nominal one ifα n ∈ [0.0357, 0.0695] and is presented in Tables 1 and 2 in bold. This criterion has been used previously among others by [27,31].
Regarding the proposed test we can see that for small sample sizes the estimated (true) size is close to the nominal for α 1 ∈ [0.10, 0.50] and α 2 ≤ 0.10 while for moderate sample sizes for α 1 ∈ [10 −7 , 0.50] and α 2 ≤ 0.10. With reference to the classical tests-of-fit we can observe that the size of the CR(θ r CR ) is close to the nominal for every sample size whereas the size of G 2 (θ r ) and X 2 (θ r X 2 ) is close only for moderate sample sizes. Finally, we note that the estimated (true) size of FT(θ r FT ) fails to be close to the nominal both for small and moderate sample sizes.
In Tables 3-5, we provide the results regarding the power of the proposed family of test statistics for the three alternatives and sample sizes n = 20, 25, 40, 45, while Table 2 provides the results regarding the power of the classical tests-of-fit. The performance tends to be better as we deviate from the null model and as the sample size increases both for the classical and the proposed tests.   As general comments regarding the behaviour of the proposed and the classical testsof-fit in terms of power we state that the best results for the T α 1 Φ 1 (θ r (Φ 2 ,α 2 ) ) are obtained for small values of α 1 in the range (0, 0.1] and large values of α 2 with α 1 ≤ α 2 . Note that although in terms of power results become better as α 2 increases in terms of size these are adequate only for α 2 ≤ 0.5. In addition, we can observe that the performance of T α 1 Φ 1 (θ r (Φ 2 ,α 2 ) ) is better than the CR(θ r CR ) and X 2 (θ r X 2 ) for every alternative and every sample size for α 1 ≤ 0.1 and α 2 ≤ 0.5 and slightly better than G 2 (θ r ) for small values of α 1 and large values of α 2 , for example for α 1 = 0.01 and α 2 = 0.50. Furthermore, we can observe that for α 1 = 0.1 and α 2 ≤ 0.1 the size of the test is better than the size of the G 2 (θ r ) and slightly worst form the size of the CR(θ r CR ) and X 2 (θ r X 2 ) test statistics while its power is quite better than the power of the CR(θ r CR ) and X 2 (θ r X 2 ) and slightly worst than the G 2 (θ r ). Additionally, we can see that as α 1 and α 2 tend to 0 the behaviour of the 2 ) ) test statistic coincides with the G 2 (θ r ) test both in terms of size and power as it was expected.
In order to attain a better insight about the behaviour of the test statistics, we apply Dale's criterion, not only for the nominal size α = 0.05, but also for a range of nominal sizes that are of interest. Based on the previous analysis, beside the classical tests, we will focus our interest on the T 0.05 Φ 1 (θ r (Φ 2 ,0.05) ), T 0.10 Φ 1 (θ r (Φ 2 ,0.10) ), and T 0.20 Φ 1 (θ r (Φ 2 ,0.20) ). The following simplified notation is used in every Figure, 20) ). From Figure 1a, we can see that for small sample sizes (n = 25) T 0.20 Φ 1 (θ r (Φ 2 ,0.20) ) and CR(θ r CR ) satisfy Dale's criterion for every nominal size while T 0.10 Φ 1 (θ r (Φ 2 ,0.10) ) and X 2 (θ r X 2 ) for nominal sizes greater than 0.03 and 0.06, respectively. Note that the dashed line in Figure 1 denotes the situation in which the estimated (true) size equals to the nominal size and thus lines that lie above this reference line refer to liberal tests while those that lie below to conservative ones. On the other hand, for moderate sample sizes (n = 45) all chosen test statistics satisfy Dale's criterion except FT(θ r FT ). Taking into account the fact that the actual size of each test differs from the targeted nominal size, we have to make an adjustment in order to proceed further with the comparison of the tests in terms of power. We focus our interest in those tests that satisfy Dale's criterion and follow the method proposed in [32] which involves the so-called receiver operating characteristic (ROC) curves. In particular, let G(t) = Pr(T ≥ t) be the survivor function of a general test statistic T, and c = inf{t : G(t) ≤ α} be the critical value, then ROC curves can be formulated by plotting the power G 1 (c) against the size G 0 (c) for various values of the critical value c. Note that with G 0 (t) we denote the distribution of the test statistic under the null hypothesis and with G 1 (t) under the alternative. Since results are similar for every alternative we restrict ourselves to w = 0.60 which refers to an alternative that is neither too close nor too far from the null. For small sample sizes (n = 25) results are presented in Figure 2, where we can see that the proposed test is superior from the classical tests-of-fit in terms of power. However, for moderate sample sizes (n = 45) we can observe in Figure 3 that G 2 (θ r ) has the best performance among all competing tests followed by the proposed test-of-fit. From the conducted analysis we conclude that regarding the proposed test there is a trade off between size and power for different choices of the indices α 1 and α 2 . In particular, we can see that as α 1 increases the size becomes smaller in the expense of smaller power, while as α 2 increases the power becomes better and the tests more liberal. In conclusion, we could state that for values of α 1 and α 2 in the range (0.05, 0.25) the resulting test statistic provides a fair balance between size and power which makes it an attractive alternative to the classical tests-of-fit where for small sample sizes larger values of the indices are preferable whereas for moderate sample sizes, smaller ones are recommended.

Conclusions
In this work, a general divergence family of test statistics is presented for hypothesis testing problems as in (3), under constraints. For estimating purposes, we introduce, discuss and use the rMD (restricted minimum divergence) estimator presented in (8). The proposed double index (dual) divergence test statistic involves two pairs of elements, namely (Φ 2 , α 2 ) to be used for the estimation problem and (Φ 1 , α 1 ) to be used for the testing problem. The duality refers to the fact that the two pairs may or may not be the same providing the researcher with the greatest possible flexibility.
The asymptotic distribution of the dual divergence test statistic is found to be proportional to the chi-squared distribution irrespectively of the nature of the multinomial model, as long as the values of the two indicators involved are relative close to zero (less than 0.5). Such values are known to provide a satisfactory balance between efficiency and robustness (see, for instance, [8] or [3]).
The methodology developed in this work can be used in the analysis of contingency tables which is applicable in various scientific fields: biosciences, such as genetics [33] and epidemiology [34]; finance, such as the evaluation of investment effectiveness or business performance [35]; insurance science [36]; or socioeconomics [37]. This work concludes with a comparative simulation study between classical test statistics and members of the proposed family, where the focus is placed on the conditional independence of three random variables. Results indicate that, by selecting wisely the values of the α 1 and α 2 indices, we can derive a test statistic that can be thought of as a powerful and reliable alternative to the classical tests-of-fit especially for small sample sizes.

Acknowledgments:
The authors wish to express their appreciation to the anonymous referees and the Associated Editor for their valuable comments and suggestions. The authors wish also to express their appreciation to the professor A. Batsidis of the University of Ioannina for bringing to their attention citation [31] which helped greatly the comparative analysis performed in this work. This work was completed as part of the first author PhD thesis and falls within the research activities of the Laboratory of Statistics and Data Analysis of the University of the Aegean.

Conflicts of Interest:
The authors declare no conflict of interest.
The mapping p : Θ → P is continuous at every point θ ∈ Θ.