A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance

Measuring and testing association between categorical variables is one of the long-standing problems in multivariate statistics. In this paper, I define a broad class of association measures for categorical variables based on weighted Minkowski distance. The proposed framework subsumes some important measures including Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. In addition, I establish the strong consistency of the defined measures for testing independence in two-way contingency tables, and derive the scaled forms of unweighted measures.


Introduction
Measuring and testing the association between categorical variables from observed data is one of the long-standing problems in multivariate statistics. The observed frequencies of two categorical variables are often displayed in a two-way contingency table, and a multinomial distribution can be used to model the cell counts. To be specific, let X and Y be two categorical random variables with finite sampling spaces X and Y (|X | < ∞, |Y | < ∞, where | · | stands for the cardinality of a set), and a simple random sample of size N can be summarized in a |X | × |Y | table with count N xy in cell (x, y). Let f (x, y), f (x), and f (y) be the joint and marginal probabilities of X and Y, i.e., f (x, y) = P(X = x, Y = y), f (x) = P(X = x), f (y) = P(Y = y), then the statistical independence between X and Y can be defined as f (x, y) = f (x) f (y) for any (x, y) ∈ X × Y, i.e., all joint probabilities equal the product of their marginal probabilities. Pearson's chi-squared statistic, where f N (x, y) = N xy /N, f N (x) = ∑ y∈Y N xy /N, and f N (y) = ∑ x∈X N xy /N, has been widely used to test independence in two-way contingency tables. Under independence and sufficient sample size, X 2 approximately follows a chi-squared distribution with d f = (|X | − 1)(|Y | − 1). However, for insufficient sample size (e.g., min x,y N x+ N +y /N < 5, where N x+ = ∑ y∈Y N xy , N +y = ∑ x∈X N xy ), the chi-squared test tends to be conservative. Zhang (2019) suggested a random permutation test based on the test statistic which is derived from the squared distance covariance, a measure of the dependence between two random vectors of any type (discrete or continuous) [1,2]. The D 2 statistic is closely related to Pearson's chi-squared statistic, both measuring the squared distance between f (x, y) and f (x) f (y), (x, y) ∈ X × Y. Although the distance covariance test has better empirical performance than Pearson's chi-squared test, especially for small sample size, its theoretical properties have not been investigated. In addition, Zhang (2019) only studied two alternative measures, including distance covariance and projection correlation, but there are many other association measures in the literature remaining unexplored.
To name a few, Goodman and Kruskal (1954) introduced two association measures for categorical variables, namely the concentration coefficient and the λ coefficient [3]. Cui et al. (2015) developed a generic association measure based on a mean-variance index [4]. Theil (1970) proposed measuring the association between two categorical variables by the uncertainty coefficient [5]. McCane and Albert (2008) introduced the symbolic covariance, which expresses the covariance between categorical variables in terms of symbolic expressions [6]. In addition, Reshef et al. (2011) proposed a pairwise dependence measure called maximal information coefficient (MIC) based on the grid that maximizes the mutual information gained from the data [7].
The purpose of this paper is to extend my previous work [1] to a broad class of association measures using a general weighted Minkowski distance, and numerically evaluate some selected measures from the proposed class. The proposed class unifies many existing measures including φ coefficient, Cramér's V, distance covariance, total variation distance and a slightly modified mean variance index. Furthermore, the strong consistency of the independence tests based on these measures was established, and the scaled forms of unweighted measures were derived. The proposed class provides a rich set of alternatives to the prevailing chi-squared statistic, and it has many potential applications. For instance, it can be applied to the correlation-based modeling, such as correlation-based deep learning [8]. As enlightened by a reviewer, the proposed method may also be applied to the pseudorandom number generator tests, and may improve some existing chi-squared based tests including the poker test and gap test [9].
The remainder of this paper is structured as follows: In Section 2, I introduce the defined class of association measures, and study some important special cases. The scaled forms of unweighted measures are also derived. Section 3 compares the performance of selected measures using simulated data. Section 4 discusses some extensions including the application to ordinal data and conditional independence test for three-way tables.

A Class of Association Measures for Categorical Variables
As the strength of association between two categorical variables can be reflected by the distance between f (x, y) and f (x) f (y), here I define a class of measures based on the weighted Minkowski distance where r ≥ 1, ω(x, y) > 0, and ω(x, y) only depends on the marginal distributions of X and Y. For 0 < r < 1, the defined distance violates the triangle inequality therefore it is not a metric. However, r = ∞ is allowed, and I denote by L ∞,ω (X, Y) the maximum norm. It can be proved for a given weight ω(x, y). Throughout this paper, I denote by L r (X, Y) the unweighted measures, i.e., ω(x, y) = 1. The defined class is quite broad and I begin with some important special cases. Firstly, most of the chi-squared-type measures belong to the defined class. For instance, the φ coefficient for 2 × 2 tables, i.e., |X | = |Y | = 2, Distance covariance for categorical variables also belongs to the defined class. Distance covariance is a measure of statistical dependence between two random vectors X and Y. It is a special case of Hilbert-Schmidt independence criterion (HSIC) [12]. Let (X 1 , Y 1 ), (X 2 , Y 2 ) and (X 3 , Y 3 ) be three independent copies of (X, Y), the distance covariance between X and Y is defined as the square root of where · represents distance between vectors, e.g., Euclidean distance. An alternative definition of distance covariance is given in Sejdinovic et al. (2013) [12], which only uses two independent copies of (X, Y). A proof of the equivalency between the two definitions is provided in Appendix A.1. One property of distance covariance is that dCov 2 (X, Y) = 0 if and only if X and Y are statistically independent, indicating its potential of measuring nonlinear dependence. Zhang (2019) studied the distance covariance for categorical variables under multinomial model. Define and it is easy to see that dCov(X, Another special case is total variation distance, which is defined as the largest difference between two probability measures [13]. Let µ 0 (·) and µ α (·) be the measures under independence and dependence respectively, the total variation distance between µ 0 and µ α can be used to measure the dependence between variables X and Y In the case of discrete sampling spaces, let In addition, I pointed out that the mean variance index (MV) recently developed by Cui et al. [4] also belongs to our defined class, subject to some slight modifications. The MV between two variables X and Y is defined as Similar as the MV index, the symmetric version of some other directional association measures (e.g., the concentration coefficient [3]), are also the special cases of L r,ω .

Sample Estimate and Independence Test
Given a simple random sample of size N, one can estimate L r,ω,N (X, Y) using sample quantities where f N (x, y), f N (x) and f N (y) represent the maximum likelihood estimates of joint and marginal probabilities, respectively, i.e., f N (x, y) = N xy /N, f N (x) = ∑ y∈Y N xy /N, and f N (y) = ∑ x∈X N xy /N. The following theorem establishes the strong consistency of the independence test based on L r,ω,N (X, Y) (a detailed proof is provided in Appendix A.4): Theorem 1. Assume that the estimated weights are bounded above by a constant C > 0, i.e., sup x,y ω N (x, y) = C, then for any r ≥ 1 and > 0, we have P L r,ω, The inequality also holds for maximum norm L ∞,ω,N (X, Y).
It is noteworthy that the asymptotic null distribution of L r,ω,N (X, Y) is impratical to derive. The theorem above provides a simple way to compute the upper bound of p-value, however, the bound (2 |X ||Y | + 2 |Y | + 2 |X | ) exp(−N 2 /18C 2 ) is generally not tight, thus the p-value could be largely overestimated. Here, I suggest a simple permutation procedure to evaluate the significance. One can randomly shuffle the observations of X (or equivalently, the observations of Y) for M times, and compute the test statistic L r,ω,N (X perm , Y) for each permuted dataset. The permutation p-value can be computed as the proportion of L r,ω,N (X perm , Y)'s that exceed the actually observed one. I used the permutation p-value to evaluate statistical significant in our simulation studies.

Scaled Forms of Unweighted Measures
Motivated by the classic correlation coefficient, I define the following scaled form for unweighted measure L r (X, Y): where L r (X, The term L r (X, X) can be written as and as examples, the explicit expressions for L 1 (X, X), L 2 (X, X), and L ∞ (X, X) are given below It can be seen that L * 2 (X, Y) is same as the distance correlation between X and Y [1], therefore 0 ≤ L * 2 (X, Y) ≤ 1, where L * r (X, Y) = 0 if and only if X and Y are independent. In fact, for any = 0 if and only if X and Y are independent, and L * r (X, Y) = 1 if and only if X and Y have perfect association, i.e., |X | = |Y | and for any x ∈ X , there exists a unique y ∈ Y, such that f (x, y) = f (x) = f (y).
For L * ∞ (X, Y), by Cauchy-Schwarz inequality, therefore 0 ≤ L * ∞ ≤ 1. However, in general, L * ∞ (X, Y) = 1 does not imply that X and Y are perfectly associated. I gave an example in Table 1, where L * ∞ (X, Y) = 1 but X and Y are not perfectly associated. Table 1. An example that X and Y are not perfectly associated, but L * ∞ (X, Y) = 1.

Numerical Study
Two simulation studies were conducted to compare the performance of some selected measures from our defined class. In both simulations, I set |X | = |Y | = 10 and varied the sample size from 25 to 500, so that the simulated contingency tables were relatively large and sparse (average count N/|X ||Y | is between 0.25 and 5).
In the first simulation study, I considered the independence test based on different unweighted measures, including L 1 , L 2 , L 4 and L ∞ , under the following multinomial settings:  Figure 1 summarizes the empirical statistical power of the four tests under significance level 0.05. It could be seen that, in settings 1 and 2, the L 2 measure (Euclidean distance) performed consistently better than the other three (comparable to L 4 ). The maximum norm L ∞ performs the worst in these two settings. In settings 3 and 4, where a single cell accounts for most deviation from independence, the maximum norm performs the best, while the L 1 measure (Manhattan distance) gives the lowest power. Figure 2 summarizes the type I error rate, where it can be seen that all the four tests control the type I error rates at the nominal level of 0.05. In the second simulation study, I focused on L 2,ω (X, Y) as it subsumes many popular measures. In particular, I compared three different weight functions, including ω(x, y) = 1 (distance covariance), f (x) ) (modified mean variance index). Figure 3 shows the empirical statistical power of the three measures under settings 1 and 2, where it can be seen that the unweighted L 2 compares favorably to the weighted ones.
Based on the simulation studies, I recommend to the unweighted L r measures with a moderate choice of r, for instance, r = 2, 3, 4 for large sparse tables, because they could give satisfactory and stable statistical power in general scenarios. The maximum norm L ∞ is not recommended, unless one is very confident that there exist a very small number of cells that account for most deviation from independence.

Discussion
In this work, I proposed a rich class of dependence measures for categorical variables based on weighted Minkowski distance. The defined class unifies a number of existing measures including Cramér's V, distance covariance, total variation distance and a slightly modified mean variance index. I provided the scaled forms of unweighted measures, which range from 0 (independence) to 1 (perfect association). Further, I established the strong consistency of the defined measures and suggested a simple permutation test for evaluating significance. Although I have used nominal and univariate categorical variables for illustrations, the proposed framework can be extended to other data types and problems: First, the proposed measures can be used to detect ordinal association by assigning proper weights. Similar as Pearson's correlation coefficient, one may assign larger weights to more extreme categories of X and Y. To be specific, let d(x, x ) be the predefined distance between categories X = x and X = x , and d(y, y ) be the distance between y and y , and one could apply the following weight function which assigns larger weights to cells in the corners but smaller weights to cells in the center of the table.
Second, my framework can be generalized to random vectors and multi-way tables. In the case of three-way table (X, Y, Z), one can define the following Minkowski distance between f (x, y, z) and f (x, y) f (z) which can be used to test the joint independence between (X, Y) and Z, or equivalently, to test the homogeneity of the joint distribution of (X, Y) at different levels of Z. A similar permutation procedure can be applied to evaluate the statistical significance. One can also define the distance between f (x, y, z) and f (x) f (y) f (z) to test the mutual independence of (X, Y, Z) Furthermore, the framework can be extended to conditional independence test in three-way tables [14], by defining distance between conditional joint probabilities f (x, y|z) and the product of conditional marginal probabilities f (x|z) f (y|z) Author Contributions: Q.Z. conceived of the presented idea, developed the theory, performed the computations and wrote the manuscript.
Funding: This research received no external funding.

Acknowledgments:
The author would like to thank the editor and two reviewers for their thoughtful comments and efforts towards improving the manuscript.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: • Definition by Sejdinovic et al. (2013): The first two terms are the same, and the equivalency between the two definitions can be showed as follows: )dy 3 f (x 1 , y 1 )dx 1 dy 1 = x 1 x 2 y 1 y 3 x 1 − x 2 y 1 − y 3 f (x 1 , y 1 ) f (x 2 ) f (y 3 )dx 1 dx 2 dy 1 dy 3