A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance

Zhang, Qingyang

doi:10.3390/e21100990

Open AccessArticle

A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance

by

Qingyang Zhang

Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA

Entropy 2019, 21(10), 990; https://doi.org/10.3390/e21100990

Submission received: 16 August 2019 / Revised: 28 September 2019 / Accepted: 10 October 2019 / Published: 11 October 2019

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Measuring and testing association between categorical variables is one of the long-standing problems in multivariate statistics. In this paper, I define a broad class of association measures for categorical variables based on weighted Minkowski distance. The proposed framework subsumes some important measures including Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. In addition, I establish the strong consistency of the defined measures for testing independence in two-way contingency tables, and derive the scaled forms of unweighted measures.

Keywords:

dependence measure; categorical variable; Minkowski distance; sparse contingency table; total variation distance; mean variance index

1. Introduction

Measuring and testing the association between categorical variables from observed data is one of the long-standing problems in multivariate statistics. The observed frequencies of two categorical variables are often displayed in a two-way contingency table, and a multinomial distribution can be used to model the cell counts. To be specific, let X and Y be two categorical random variables with finite sampling spaces

X

and

Y

(

| X | < \infty, | Y | < \infty

, where

| \cdot |

stands for the cardinality of a set), and a simple random sample of size N can be summarized in a

| X | \times | Y |

table with count

N_{x y}

in cell

(x, y)

. Let

f (x, y)

,

f (x)

, and

f (y)

be the joint and marginal probabilities of X and Y, i.e.,

f (x, y) = P (X = x, Y = y), f (x) = P (X = x), f (y) = P (Y = y)

, then the statistical independence between X and Y can be defined as

f (x, y) = f (x) f (y)

for any

(x, y) \in X \times Y

, i.e., all joint probabilities equal the product of their marginal probabilities. Pearson’s chi-squared statistic,

X^{2} = \sum_{x \in X} \sum_{y \in Y} \frac{{(f_{N} (x, y) - f_{N} (x) f_{N} (y))}^{2}}{f_{N} (x) f_{N} (y) / N},

where

f_{N} (x, y) = N_{x y} / N

,

f_{N} (x) = \sum_{y \in Y} N_{x y} / N

, and

f_{N} (y) = \sum_{x \in X} N_{x y} / N

, has been widely used to test independence in two-way contingency tables. Under independence and sufficient sample size,

X^{2}

approximately follows a chi-squared distribution with

d f = (| X | - 1) (| Y | - 1)

. However, for insufficient sample size (e.g.,

{min}_{x, y} N_{x +} N_{+ y} / N < 5

, where

N_{x +} = \sum_{y \in Y} N_{x y}, N_{+ y} = \sum_{x \in X} N_{x y}

), the chi-squared test tends to be conservative. Zhang (2019) suggested a random permutation test based on the test statistic

D^{2} = \sum_{x \in X} \sum_{y \in Y} {(f_{N} (x, y) - f_{N} (x) f_{N} (y))}^{2},

which is derived from the squared distance covariance, a measure of the dependence between two random vectors of any type (discrete or continuous) [1,2]. The

D^{2}

statistic is closely related to Pearson’s chi-squared statistic, both measuring the squared distance between

f (x, y)

and

f (x) f (y)

,

(x, y) \in X \times Y

. In the numerical study of Zhang (2019), the distance covariance test was evaluated in terms of the statistical power and type I error rate under various settings (see Figures 1–3 in [1]). It is found that for relatively large sample sizes, the distance covariance test performs similarly well as Pearson’s chi-squared test. However, for relatively small sample sizes, the distance covariance test is substantially more powerful and it controls the type I error rate at the nominal level. For small sample, Pearson’s chi-squared test exhibits substantial conservativeness, in the sense that the type I error rate is much lower than the nominal level and it fails to reject many false hypotheses. For instance, in a simulation setting with 20 by 20 table and only 50 samples, the statistical power and type I error rate are both close to zero by Pearson’s chi-squared test, indicating an extreme conservativeness.

Although the distance covariance test has better empirical performance than Pearson’s chi-squared test, especially for small sample size, its theoretical properties have not been investigated. In addition, Zhang (2019) only studied two alternative measures, including distance covariance and projection correlation, but there are many other association measures in the literature remaining unexplored. To name a few, Goodman and Kruskal (1954) introduced two association measures for categorical variables, namely the concentration coefficient and the

λ

coefficient [3]. Cui et al. (2015) developed a generic association measure based on a mean-variance index [4]. Theil (1970) proposed measuring the association between two categorical variables by the uncertainty coefficient [5]. McCane and Albert (2008) introduced the symbolic covariance, which expresses the covariance between categorical variables in terms of symbolic expressions [6]. In addition, Reshef et al. (2011) proposed a pairwise dependence measure called maximal information coefficient (MIC) based on the grid that maximizes the mutual information gained from the data [7].

The purpose of this paper is to extend my previous work [1] to a broad class of association measures using a general weighted Minkowski distance, and numerically evaluate some selected measures from the proposed class. The proposed class unifies many existing measures including

ϕ

coefficient, Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. Furthermore, the strong consistency of the independence tests based on these measures was established, and the scaled forms of unweighted measures were derived. The proposed class provides a rich set of alternatives to the prevailing chi-squared statistic, and it has many potential applications. For instance, it can be applied to the correlation-based modeling, such as correlation-based deep learning [8]. As enlightened by a reviewer, the proposed method may also be applied to the pseudorandom number generator tests, and may improve some existing chi-squared based tests including the poker test and gap test [9].

The remainder of this paper is structured as follows: In Section 2, I introduce the defined class of association measures, and study some important special cases. The scaled forms of unweighted measures are also derived. Section 3 compares the performance of selected measures using simulated data. Section 4 discusses some extensions including the application to ordinal data and conditional independence test for three-way tables.

2. Methods

2.1. A Class of Association Measures for Categorical Variables

As the strength of association between two categorical variables can be reflected by the distance between

f (x, y)

and

f (x) f (y)

, here I define a class of measures based on the weighted Minkowski distance

L_{r, ω} (X, Y) = {\{\sum_{x \in X} \sum_{y \in Y} {| f (x, y) - f (x) f (y) |}^{r} ω^{r} (x, y)\}}^{\frac{1}{r}},

(1)

where

r \geq 1

,

ω (x, y) > 0

, and

ω (x, y)

only depends on the marginal distributions of X and Y. For

0 < r < 1

, the defined distance violates the triangle inequality therefore it is not a metric. However,

r = \infty

is allowed, and I denote by

L_{\infty, ω} (X, Y)

the maximum norm. It can be proved that

L_{1, ω} (X, Y) \geq L_{2, ω} (X, Y) \geq \dots \geq L_{\infty, ω} (X, Y)

for a given weight

ω (x, y)

. Throughout this paper, I denote by

L_{r} (X, Y)

the unweighted measures, i.e.,

ω (x, y) = 1

. The defined class is quite broad and I begin with some important special cases.

Firstly, most of the chi-squared-type measures belong to the defined class. For instance, the

ϕ

coefficient for

2 \times 2

tables, i.e.,

| X | = | Y | = 2

,

ϕ (X, Y) = {\{\sum_{x \in X} \sum_{y \in Y} \frac{{| f (x, y) - f (x) f (y) |}^{2}}{f (x) f (y)}\}}^{\frac{1}{2}},

is a special case of

L_{2, ω} (X, Y)

, where

ω (x, y) = {f (x) f (y)}^{- 1 / 2}

. Extensions of

ϕ (X, Y)

to

I \times J

tables including Cramér’s V and Tschuprow’s T [10,11],

\begin{matrix} V (X, Y) & = {\{\sum_{x \in X} \sum_{y \in Y} \frac{{| f (x, y) - f (x) f (y) |}^{2}}{f (x) f (y)}\}}^{\frac{1}{2}} {\{\frac{1}{min (| X | - 1, | Y | - 1)}\}}^{\frac{1}{2}}, \\ T (X, Y) & = {\{\sum_{x \in X} \sum_{y \in Y} \frac{{| f (x, y) - f (x) f (y) |}^{2}}{f (x) f (y)}\}}^{\frac{1}{2}} {\{\frac{1}{\sqrt{(| X | - 1) (| Y | - 1)}}\}}^{\frac{1}{2}}, \end{matrix}

are also special cases of

L_{2, ω} (X, Y)

, where

ω (x, y) = {f (x) f (y) min (| X | - 1, | Y | - {1)}}^{- 1 / 2}

for Cramér’s V, and

ω (x, y) = {f (x) f (y) \sqrt{(| X | - 1) (| Y | - 1)}}^{- 1 / 2}

for Tschuprow’s T.

Distance covariance for categorical variables also belongs to the defined class. Distance covariance is a measure of statistical dependence between two random vectors X and Y. It is a special case of Hilbert-Schmidt independence criterion (HSIC) [12]. Let

(X_{1}, Y_{1})

,

(X_{2}, Y_{2})

and

(X_{3}, Y_{3})

be three independent copies of

(X, Y)

, the distance covariance between X and Y is defined as the square root of

{dCov}^{2} (X, Y) = cov (∥ X_{1} - X_{2} ∥, ∥ Y_{1} - Y_{2} ∥) - 2 cov (∥ X_{1} - X_{2} ∥, ∥ Y_{1} - Y_{3} ∥),

(2)

where

∥ \cdot ∥

represents distance between vectors, e.g., Euclidean distance. An alternative definition of distance covariance is given in Sejdinovic et al. (2013) [12], which only uses two independent copies of

(X, Y)

. A proof of the equivalency between the two definitions is provided in Appendix A.1. One property of distance covariance is that

{dCov}^{2} (X, Y) = 0

if and only if X and Y are statistically independent, indicating its potential of measuring nonlinear dependence. Zhang (2019) studied the distance covariance for categorical variables under multinomial model. Define

∥ X_{1} - X_{2} ∥ = 0

if

X_{1} = X_{2}

and 1 otherwise, one can show that

dCov (X, Y) = {\{\sum_{x \in X} \sum_{y \in Y} {| f (x, y) - f (x) f (y) |}^{2}\}}^{\frac{1}{2}},

(3)

and it is easy to see that

dCov (X, Y) = L_{2} (X, Y)

. A detailed proof of Equation (3) is provided in Appendix A.2.

Another special case is total variation distance, which is defined as the largest difference between two probability measures [13]. Let

μ_{0} (\cdot)

and

μ_{α} (\cdot)

be the measures under independence and dependence respectively, the total variation distance between

μ_{0}

and

μ_{α}

can be used to measure the dependence between variables X and Y

δ (μ_{0}, μ_{α}) = max_{S \subset X \times Y} | μ_{0} (S) - μ_{α} (S) | .

(4)

In the case of discrete sampling spaces, let

S^{+} = {(x, y), s . t ., f (x, y) > f (x) f (y)}

and

S^{-} = {(x, y), s . t ., f (x, y) < f (x) f (y)}

, then we have

δ (μ_{0}, μ_{α}) = | μ_{0} (S^{+}) - μ_{α} (S^{+}) | = | μ_{0} (S^{-}) - μ_{α} (S^{-}) | = \frac{1}{2} \sum_{x \in X} \sum_{y \in Y} | f (x, y) - f (x) f (y) |,

(5)

therefore

δ (μ_{0}, μ_{α}) = L_{1, ω} (X, Y)

, where

ω (X, Y) = \frac{1}{2}

.

In addition, I pointed out that the mean variance index (MV) recently developed by Cui et al. [4] also belongs to our defined class, subject to some slight modifications. The MV between two variables X and Y is defined as

M V (X | Y) = E_{X} (V_{Y} (F (X | Y)))

, where

F (x | y)

stands for conditional distribution function. It can be proved that

M V (X | Y) = 0

if and only if X and Y are independent. The MV measure is originally developed for continuous variables. To make it suitable for categorical variables while maintaining the main theoretical property, I slightly modified the definition of MV. First, I replaced the conditional c.d.f.

F (x | y)

with conditional p.m.f.

f (x | y)

. Second, as the MV measure is generally asymmetric, i.e.,

M V (X | Y) \neq M V (Y | X)

, I considered a symmetric version of the index,

M V (X, Y) = \frac{1}{2} (M V (X | Y) + M V (Y | X))

. With the two modifications, one can prove the following result (a detailed proof is provided in Appendix A.3)

\sqrt{MV (X, Y)} = {\{\sum_{x \in X} \sum_{y \in Y} \frac{1}{2} {| f (x, y) - f (x) f (y) |}^{2} (\frac{f (x)}{f (y)} + \frac{f (y)}{f (x)})\}}^{\frac{1}{2}},

therefore

\sqrt{MV (X, Y)} = L_{2, ω} (X, Y)

, where

ω (x, y) = \sqrt{\frac{1}{2} (\frac{f (x)}{f (y)} + \frac{f (y)}{f (x)})}

. As

\frac{1}{2} (\frac{f (x)}{f (y)} + \frac{f (y)}{f (x)}) \geq 1

, we also have

\sqrt{MV (X, Y)} \geq L_{2} (X, Y)

.

Similar as the MV index, the symmetric version of some other directional association measures (e.g., the concentration coefficient [3]), are also the special cases of

L_{r, ω}

.

2.2. Sample Estimate and Independence Test

Given a simple random sample of size N, one can estimate

L_{r, ω, N} (X, Y)

using sample quantities

L_{r, ω, N} (X, Y) = {\{\sum_{x \in X} \sum_{y \in Y} {| f_{N} (x, y) - f_{N} (x) f_{N} (y) |}^{r} ω_{N}^{r} (x, y)\}}^{\frac{1}{r}},

(6)

where

f_{N} (x, y)

,

f_{N} (x)

and

f_{N} (y)

represent the maximum likelihood estimates of joint and marginal probabilities, respectively, i.e.,

f_{N} (x, y) = N_{x y} / N

,

f_{N} (x) = \sum_{y \in Y} N_{x y} / N

, and

f_{N} (y) = \sum_{x \in X} N_{x y} / N

. The following theorem establishes the strong consistency of the independence test based on

L_{r, ω, N} (X, Y)

(a detailed proof is provided in Appendix A.4):

Theorem 1.

Assume that the estimated weights are bounded above by a constant

C > 0

, i.e.,

{sup}_{x, y} ω_{N} (x, y) = C

, then for any

r \geq 1

and

ϵ > 0

, we have

P (L_{r, ω, N} (X, Y) > ϵ) < (2^{| X | | Y |} + 2^{| Y |} + 2^{| X |}) exp (- N ϵ^{2} / 18 C^{2})

under independence. The inequality also holds for maximum norm

L_{\infty, ω, N} (X, Y)

.

It is noteworthy that the asymptotic null distribution of

L_{r, ω, N} (X, Y)

is impratical to derive. The theorem above provides a simple way to compute the upper bound of p-value, however, the bound

(2^{| X | | Y |} + 2^{| Y |} + 2^{| X |}) exp (- N ϵ^{2} / 18 C^{2})

is generally not tight, thus the p-value could be largely overestimated. Here, I suggest a simple permutation procedure to evaluate the significance. One can randomly shuffle the observations of X (or equivalently, the observations of Y) for M times, and compute the test statistic

L_{r, ω, N} (X_{p e r m}, Y)

for each permuted dataset. The permutation p-value can be computed as the proportion of

L_{r, ω, N} (X_{p e r m}, Y)

’s that exceed the actually observed one. I used the permutation p-value to evaluate statistical significant in our simulation studies.

2.3. Scaled Forms of Unweighted Measures

Motivated by the classic correlation coefficient, I define the following scaled form for unweighted measure

L_{r} (X, Y)

:

L_{r}^{*} (X, Y) = \frac{L_{r} (X, Y)}{\sqrt{L_{r} (X, X)} \sqrt{L_{r} (Y, Y)}},

(7)

where

L_{r} (X, X) = {\{\sum_{x \in X} \sum_{x^{'} \in X} {| f (x, x^{'}) - f (x) f (x^{'}) |}^{r}\}}^{\frac{1}{r}}

,

f (x, x^{'}) = f (x)

if

x = x^{'}

and

f (x, x^{'}) = 0

otherwise.

The term

L_{r} (X, X)

can be written as

L_{r} (X, X) = {\{\sum_{x \in X} | f (x) - f^{2} {(x) |}^{r} + \sum_{x \in X} \sum_{x^{'} \in X ∖ x} {| f (x) f (x^{'}) |}^{r}\}}^{\frac{1}{r}},

and as examples, the explicit expressions for

L_{1} (X, X)

,

L_{2} (X, X)

, and

L_{\infty} (X, X)

are given below

$L_{1} (X, X) = \sum_{x \in X} [f (x) - f^{2} (x)] + \sum_{x \in X} \sum_{x^{'} \in X ∖ x} [f (x) f (x^{'})] = 2 [1 - \sum_{x \in X} f^{2} (x)]$
$L_{2} (X, X) = {\{\sum_{x \in X} f^{2} (x) [\sum_{x \in X} f^{2} (x) + 1] - 2 \sum_{x \in X} f^{3} (x)\}}^{\frac{1}{2}}$
$L_{\infty} (X, X) = {max}_{x \in X} [f (x) - f^{2} (x)] \lor {max}_{x \in X, x^{'} \in X ∖ x} [f (x) f (x^{'})] = {max}_{x \in X} [f (x) - f^{2} (x)]$

It can be seen that

L_{2}^{*} (X, Y)

is same as the distance correlation between X and Y [1], therefore

0 \leq L_{2}^{*} (X, Y) \leq 1

, where

L_{r}^{*} (X, Y) = 0

if and only if X and Y are independent. In fact, for any

1 \leq r < \infty

, if

f (x) > 0, f (y) > 0

for

x \in X, y \in Y

, it can be proved that

0 \leq L_{r}^{*} (X, Y) \leq 1

, where

L_{r}^{*} (X, Y) = 0

if and only if X and Y are independent, and

L_{r}^{*} (X, Y) = 1

if and only if X and Y have perfect association, i.e.,

| X | = | Y |

and for any

x \in X

, there exists a unique

y \in Y

, such that

f (x, y) = f (x) = f (y)

.

For

L_{\infty}^{*} (X, Y)

, by Cauchy-Schwarz inequality,

\begin{matrix} L_{\infty} (X, Y) & = max_{x \in X, y \in Y} | f (x, y) - f (x) f (y) | \\ = max_{x \in X, y \in Y} | c o v (I {X = x}, I {Y = y}) | \\ \leq max_{x \in X, y \in Y} \sqrt{V (I {X = x}) V (I {Y = y})} \\ = max_{x \in X} \sqrt{V (I {X = x})} max_{y \in Y} \sqrt{V (I {Y = y})} \\ = \sqrt{max_{x \in X} [f (x) - f^{2} (x)]} \sqrt{max_{y \in Y} [f (y) - f^{2} (y)]} \\ = \sqrt{L_{\infty} (X, X) L_{\infty} (Y, Y)}, \end{matrix}

therefore

0 \leq L_{\infty}^{*} \leq 1

. However, in general,

L_{\infty}^{*} (X, Y) = 1

does not imply that X and Y are perfectly associated. I gave an example in Table 1, where

L_{\infty}^{*} (X, Y) = 1

but X and Y are not perfectly associated.

3. Numerical Study

Two simulation studies were conducted to compare the performance of some selected measures from our defined class. In both simulations, I set

| X | = | Y | = 10

and varied the sample size from 25 to 500, so that the simulated contingency tables were relatively large and sparse (average count

N / | X | | Y |

is between

0.25

and 5).

In the first simulation study, I considered the independence test based on different unweighted measures, including

L_{1}

,

L_{2}

,

L_{4}

and

L_{\infty}

, under the following multinomial settings:

Setting 1: $f (x, y) = 0.05$ for 10 randomly selected cells and $f (x, y) = \frac{0.5}{90}$ for the remaining 90 cells
Setting 2: $f (x, y) = 0.08$ for 10 randomly selected cells and $f (x, y) = \frac{0.2}{90}$ for the remaining 90 cells
Setting 3: $f (x, y) = 0.1$ for one randomly selected cell and $f (x, y) = \frac{0.9}{99}$ for the remaining 99 cells
Setting 4: $f (x, y) = 0.2$ for one randomly selected cell and $f (x, y) = \frac{0.8}{99}$ for the remaining 99 cells

For each test, the p-values were computed based on 2000 random permutations. Figure 1 summarizes the empirical statistical power of the four tests under significance level 0.05. It could be seen that, in settings 1 and 2, the

L_{2}

measure (Euclidean distance) performed consistently better than the other three (comparable to

L_{4}

). The maximum norm

L_{\infty}

performs the worst in these two settings. In settings 3 and 4, where a single cell accounts for most deviation from independence, the maximum norm performs the best, while the

L_{1}

measure (Manhattan distance) gives the lowest power. Figure 2 summarizes the type I error rate, where it can be seen that all the four tests control the type I error rates at the nominal level of 0.05.

In the second simulation study, I focused on

L_{2, ω} (X, Y)

as it subsumes many popular measures. In particular, I compared three different weight functions, including

ω (x, y) = 1

(distance covariance),

ω (x, y) = {f (x) f (y)}^{- 1 / 2}

(Pearson’s chi-squared), and

ω (x, y) = \sqrt{\frac{1}{2} (\frac{f (x)}{f (y)} + \frac{f (y)}{f (x)})}

(modified mean variance index). Figure 3 shows the empirical statistical power of the three measures under settings 1 and 2, where it can be seen that the unweighted

L_{2}

compares favorably to the weighted ones.

Based on the simulation studies, I recommend to the unweighted

L_{r}

measures with a moderate choice of r, for instance,

r = 2, 3, 4

for large sparse tables, because they could give satisfactory and stable statistical power in general scenarios. The maximum norm

L_{\infty}

is not recommended, unless one is very confident that there exist a very small number of cells that account for most deviation from independence.

4. Discussion

In this work, I proposed a rich class of dependence measures for categorical variables based on weighted Minkowski distance. The defined class unifies a number of existing measures including Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. I provided the scaled forms of unweighted measures, which range from 0 (independence) to 1 (perfect association). Further, I established the strong consistency of the defined measures and suggested a simple permutation test for evaluating significance. Although I have used nominal and univariate categorical variables for illustrations, the proposed framework can be extended to other data types and problems:

First, the proposed measures can be used to detect ordinal association by assigning proper weights. Similar as Pearson’s correlation coefficient, one may assign larger weights to more extreme categories of X and Y. To be specific, let

d (x, x^{'})

be the predefined distance between categories

X = x

and

X = x^{'}

, and

d (y, y^{'})

be the distance between y and

y^{'}

, and one could apply the following weight function

ω (x, y) = E (d (x, X) d (y, Y)) = \sum_{x^{'} \in X ∖ x, y^{'} \in Y ∖ y} d (x, x^{'}) d (y, y^{'}) f (x^{'}) f (y^{'}),

which assigns larger weights to cells in the corners but smaller weights to cells in the center of the table.

Second, my framework can be generalized to random vectors and multi-way tables. In the case of three-way table

(X, Y, Z)

, one can define the following Minkowski distance between

f (x, y, z)

and

f (x, y) f (z)

L_{r, ω} ((X, Y), Z) = {\{\sum_{x \in X} \sum_{y \in Y} \sum_{z \in Z} {| f (x, y, z) - f (x, y) f (z) |}^{r} ω^{r} (x, y, z)\}}^{\frac{1}{r}},

which can be used to test the joint independence between

(X, Y)

and Z, or equivalently, to test the homogeneity of the joint distribution of

(X, Y)

at different levels of Z. A similar permutation procedure can be applied to evaluate the statistical significance. One can also define the distance between

f (x, y, z)

and

f (x) f (y) f (z)

to test the mutual independence of

(X, Y, Z)

L_{r, ω} (X, Y, Z) = {\{\sum_{x \in X} \sum_{y \in Y} \sum_{z \in Z} {| f (x, y, z) - f (x) f (y) f (z) |}^{r} ω^{r} (x, y, z)\}}^{\frac{1}{r}},

Furthermore, the framework can be extended to conditional independence test in three-way tables [14], by defining distance between conditional joint probabilities

f (x, y | z)

and the product of conditional marginal probabilities

f (x | z) f (y | z)

L_{r, ω} (X, Y | Z) = {\{\sum_{x \in X} \sum_{y \in Y} \sum_{z \in Z} {| f (x, y | z) - f (x | z) f (y | z) |}^{r} ω^{r} (x, y, z)\}}^{\frac{1}{r}} .

Author Contributions

Q.Z. conceived of the presented idea, developed the theory, performed the computations and wrote the manuscript.

Funding

This research received no external funding.

Acknowledgments

The author would like to thank the editor and two reviewers for their thoughtful comments and efforts towards improving the manuscript.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MIC	maximum information coefficient
HSIC	Hilbert-Schmidt independence criterion
MV	mean variance index
c.d.f.	cumulative distribution function
p.m.f.	probability mass function

Appendix A. Technical Details

Appendix A.1. Equivalency between Two Definitions of Distance Covariance

Definition by Szekely et al. (2007): ${dCov}^{2} (X, Y) = E_{X_{1} X_{2} Y_{1} Y_{2}} (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{2} ∥) + E_{X_{1} X_{2}} (∥ X_{1} - X_{2} ∥) E_{Y_{1} Y_{2}} (∥ Y_{1} - Y_{2} ∥) - 2 E_{X_{1} X_{2} Y_{1} Y_{3}} (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{3} ∥)$
Definition by Sejdinovic et al. (2013): ${dCov}^{2} (X, Y) = E_{X_{1} X_{2} Y_{1} Y_{2}} (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{2} ∥) + E_{X_{1} X_{2}} (∥ X_{1} - X_{2} ∥) E_{Y_{1} Y_{2}} (∥ Y_{1} - Y_{2} ∥) - 2 E_{X_{1} Y_{1}} (E_{X_{2}} (∥ X_{1} - X_{2} ∥) E_{Y_{2}} (∥ Y_{1} - Y_{2} ∥))$

The first two terms are the same, and the equivalency between the two definitions can be showed as follows:

\begin{matrix} E_{X_{1} Y_{1}} (E_{X_{2}} (∥ X_{1} - X_{2} ∥) E_{Y_{2}} (∥ Y_{1} - Y_{2} ∥)) \\ = \int_{x_{1}} \int_{y_{1}} [\int_{x_{2}} ∥ x_{1} - x_{2} ∥ f (x_{2}) d x_{2} \int_{y_{2}} ∥ y_{1} - y_{2} ∥ f (y_{2}) d y_{2}] f (x_{1}, y_{1}) d x_{1} d y_{1} \\ = \int_{x_{1}} \int_{y_{1}} [\int_{x_{2}} ∥ x_{1} - x_{2} ∥ f (x_{2}) d x_{2} \int_{y_{3}} ∥ y_{1} - y_{3} ∥ f (y_{3}) d y_{3}] f (x_{1}, y_{1}) d x_{1} d y_{1} \\ = \int_{x_{1}} \int_{x_{2}} \int_{y_{1}} \int_{y_{3}} ∥ x_{1} - x_{2} ∥ ∥ y_{1} - y_{3} ∥ f (x_{1}, y_{1}) f (x_{2}) f (y_{3}) d x_{1} d x_{2} d y_{1} d y_{3} \\ = E_{X_{1} X_{2} Y_{1} Y_{3}} (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{3} ∥) \end{matrix}

Appendix A.2. Derivation of Equation (3)

Following Zhang (2019), I rewrite categorical variables X and Y as two random vectors of dimensions

| X |

and

| Y |

,

X = {I (X = x)}_{x \in X}

and

Y = {I (Y = y)}_{y \in Y}

, where

I (\cdot)

stands for the indicator function. Let

∥ X_{1} - X_{2} ∥

equal 0 if

X_{1} = X_{2}

and 1 otherwise. Let

(X_{1}, Y_{2})

,

(X_{2}, Y_{2})

,

(X_{3}, Y_{3})

be three independent copies of

(X, Y)

. By Equation (2), the squared distance covariance can be also expressed as

{dCov}^{2} (X, Y) = E (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{2} ∥) + E (∥ X_{1} - X_{2} ∥) E (∥ Y_{1} - Y_{2} ∥) - 2 E (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{3} ∥) .

Under multinomial sampling scheme, it is straightforward to show that

\begin{matrix} E (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{2} ∥) & = P (X_{1} \neq X_{2}, Y_{1} \neq Y_{2}) = \sum_{x \in X} \sum_{y \in Y} f (x, y) [1 - f (x) - f (y) + f (x, y)], \\ E (∥ X_{1} - X_{2} ∥) E (∥ Y_{1} - Y_{2} ∥) & = P (X_{1} \neq X_{2}) P (Y_{1} \neq Y_{2}) = [1 - \sum_{x \in X} f^{2} (x)] [1 - \sum_{y \in Y} f^{2} (y)], \\ E (∥ X_{1} - X_{2} ∥ ∥ Y_{1} - Y_{3} ∥) & = P (X_{1} \neq X_{2}, Y_{1} \neq Y_{3}) = \sum_{x \in X} \sum_{y \in Y} f (x, y) [1 - f (x)] [1 - f (y)] . \end{matrix}

Summarizing the results above, I have

dCov (X, Y) = {\{\sum_{x \in X} \sum_{y \in Y} {| f (x, y) - f (x) f (y) |}^{2}\}}^{\frac{1}{2}} .

Appendix A.3. Derivation of the Modified Mean Variance Index

The symmetric mean variance index is defined as

M V (X, Y) = \frac{1}{2} (M V (X | Y) + M V (Y | X)) = \frac{1}{2} (E_{X} (V_{Y} (f (X | Y))) + E_{Y} (V_{X} (f (Y | X)))) .

I first derived the explicit formula for

E_{X} (V_{Y} (f (X | Y)))

:

\begin{matrix} E_{X} (V_{Y} (f (X | Y))) & = E_{X} E_{Y} {(f (X | Y))}^{2} - E_{X} E_{Y}^{2} (f (X | Y)) \\ = \sum_{x \in X} \sum_{y \in Y} f^{2} (x, y) \frac{f (x)}{f (y)} - \sum_{x \in X} f^{3} (x) \\ = \sum_{x \in X} \sum_{y \in Y} \frac{f (x)}{f (y)} (f^{2} (x, y) - f^{2} (x) f^{2} (y)) \\ = \sum_{x \in X} \sum_{y \in Y} \frac{f (x)}{f (y)} {{(f (x, y) - f (x) f (y))}^{2} - 2 f^{2} (x) f^{2} (y) + 2 f (x, y) f (x) f (y)} \\ = \sum_{x \in X} \sum_{y \in Y} \frac{f (x)}{f (y)} {(f (x, y) - f (x) f (y))}^{2} \end{matrix}

Similarly, it can seen that

E_{Y} (V_{X} (f (Y | X))) = \sum_{x \in X} \sum_{y \in Y} \frac{f (y)}{f (x)} {(f (x, y) - f (x) f (y))}^{2}

, therefore

M V (X, Y) = \sum_{x \in X} \sum_{y \in Y} \frac{1}{2} (\frac{f (y)}{f (x)} + \frac{f (x)}{f (y)}) {(f (x, y) - f (x) f (y))}^{2} .

Appendix A.4. Proof of Theorem 1

Because

L_{1, ω, N} \geq L_{2, ω, N} \geq \dots \geq L_{\infty, ω, N}

, I only need prove the strong consistency for

L_{1, ω, N}

. For categorical variable X, let

f {(x)}_{x \in X}

be the probability mass function, N be the sample size, and

f_{N} (x)

be the sample estimate, Biau and Gyorfi (2005) [15] proved the following result

Lemma A1.

For any

ϵ > 0

,

P (\sum_{x \in X} | f_{N} (x) - f (x) | > ϵ) < 2^{| X |} e^{- \frac{N ϵ^{2}}{2}}

.

As

{sup}_{x, y} ω (x, y) = C > 0

, I have

L_{1, ω, N} (X, Y) \leq C (\sum_{x \in X} \sum_{y \in Y} | f_{N} (x, y) - f (x, y) | + \sum_{x \in X} \sum_{y \in Y} | f (x, y) - f (x) f (y) | + \sum_{x \in X} \sum_{y \in Y} | f_{N} (x) f_{N} (y) - f (x) f (y) |) .

Under independence, I have

\sum_{x \in X} \sum_{y \in Y} | f (x, y) - f (x) f (y) | = 0

. By Lemma A1, the first term

\sum_{x \in X} \sum_{y \in Y} | f_{N} (x, y) - f (x, y) |

satisfies that

P (C \sum_{x \in X} \sum_{y \in Y} | f_{N} (x, y) - f (x, y) | > \frac{ϵ}{3}) < 2^{| X | | Y |} e^{- \frac{N ϵ^{2}}{18 C^{2}}},

The third term can be bounded as follows:

\begin{matrix} \sum_{x \in X} \sum_{y \in Y} | f_{N} (x) f_{N} (y) - f (x) f (y) | & \leq \sum_{x \in X} \sum_{y \in Y} | f (x) f_{N} (y) - f (x) f (y) | + \sum_{x \in X} \sum_{y \in Y} | f_{N} (x) f_{N} (y) - f (x) f_{N} (y) | \\ = \sum_{y \in Y} | f_{N} (y) - f (y) | + \sum_{x \in X} | f_{N} (x) - f (x) | \end{matrix}

By Lemma A1, I have

P (C \sum_{y \in Y} | f_{N} (y) - f (y) | > \frac{ϵ}{3}) < 2^{| Y |} e^{- \frac{N ϵ^{2}}{18 C^{2}}},

P (C \sum_{x \in X} | f_{N} (x) - f (x) | > \frac{ϵ}{3}) < 2^{| X |} e^{- \frac{N ϵ^{2}}{18 C^{2}}},

and summarizing the results above, I have

\begin{matrix} P (L_{1, ω, N} (X, Y) > ϵ) & \leq P (C \sum_{x \in X} \sum_{y \in Y} | f_{N} (x, y) - f (x, y) | > \frac{ϵ}{3}) \\ + P (C \sum_{y \in Y} | f_{N} (y) - f (y) | > \frac{ϵ}{3}) + P (C \sum_{x \in X} | f_{N} (x) - f (x) | > \frac{ϵ}{3}) \\ < 2^{| X | | Y |} e^{- \frac{N ϵ^{2}}{18 C^{2}}} + 2^{| Y |} e^{- \frac{N ϵ^{2}}{18 C^{2}}} + 2^{| X |} e^{- \frac{N ϵ^{2}}{18 C^{2}}} \\ = (2^{| X | | Y |} + 2^{| X |} + 2^{| Y |}) e^{- \frac{N ϵ^{2}}{18 C^{2}}} . \end{matrix}

References

Zhang, Q. Independence test for large sparse contingency tables based on distance correlation. Stat. Probab. Lett. 2019, 148, 17–22. [Google Scholar] [CrossRef]
Szekely, G.; Rizzo, M.; Bakirov, N. Measuring and testing dependence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
Goodman, L.; Kruskal, W. Measures of association for cross classifications, part I. J. Am. Stat. Assoc. 1954, 49, 732–764. [Google Scholar]
Cui, H.; Li, R.; Zhong, W. Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. J. Am. Stat. Assoc. 2015, 110, 630–641. [Google Scholar] [CrossRef] [PubMed]
Theil, H. On the estimation of relationships involving qualitative variables. Am. J. Sociol. 1970, 76, 103–154. [Google Scholar] [CrossRef]
McCane, B.; Albert, M. Distance functions for categorical and mixed variables. Pattern Recognit. Lett. 2008, 29, 986–993. [Google Scholar] [CrossRef] [Green Version]
Reshef, D.; Reshef, Y.; Finucane, H.; Grossman, S.; McVean, P.; Turnbaugh, E.; Lander, M.; Mitzenmacher, M.; Sabeti, P. Detecting novel associations in large data sets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef]
Moews, B.; Herrmann, M.; Ibikunle, G. Lagged correlation-based deep learning for directional trend change prediction in financial time series. arXiv 2018, arXiv:1811.11287. [Google Scholar] [CrossRef]
Knuth, D. The Art of Computer Programming, 3rd ed.; Addison-Wesley: Boston, MA, USA, 1997. [Google Scholar]
Cramér, H. Mathematical Methods of Statistics; Princeton Press: Princeton, NJ, USA, 1946. [Google Scholar]
Tschuprow, A. Principles of the Mathematical Theory of Correlation. Bull. Am. Math. Soc. 1939, 46, 389. [Google Scholar]
Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2013, 41, 2263–2291. [Google Scholar] [CrossRef]
Sriperumbudur, B.; Fukumizu, K.; Gretton, A.; Scholkopf, B.; Lanckriet, G. On the empirical estimation of integral probability metric. Electron. J. Stat. 2012, 6, 1550–1599. [Google Scholar] [CrossRef]
Zhang, Q.; Tinker, J. Testing conditional independence and homogeneity in large sparse three-way tables using conditional distance covariance. Stat 2019, 8, 1–9. [Google Scholar] [CrossRef]
Biau, G.; Gyorfi, L. On the asymptotic properties of a nonparametric l1-test statistic of homogeneity. IEEE Trans. Inf. Theory 2005, 51, 3965–3973. [Google Scholar] [CrossRef]

Figure 1. Empirical statistical power of four different measures including

L_{1}

(blue),

L_{2}

(red),

L_{4}

(black) and

L_{\infty}

(green), under settings 1–4. In each setting, sample sizes are

n = 25, 50, 75, 100, 200, 500

, and all results were based on 1000 replications.

Figure 1. Empirical statistical power of four different measures including

L_{1}

(blue),

L_{2}

(red),

L_{4}

(black) and

L_{\infty}

(green), under settings 1–4. In each setting, sample sizes are

n = 25, 50, 75, 100, 200, 500

, and all results were based on 1000 replications.

Figure 2. Empirical Type I error rate of four different measures including

L_{1}

(blue),

L_{2}

(red),

L_{4}

(black) and

L_{\infty}

(green), under settings 1–4. In each setting, sample sizes are

n = 25, 50, 75, 100, 200, 500

, and all results were based on 1000 replications.

Figure 2. Empirical Type I error rate of four different measures including

L_{1}

(blue),

L_{2}

(red),

L_{4}

(black) and

L_{\infty}

(green), under settings 1–4. In each setting, sample sizes are

n = 25, 50, 75, 100, 200, 500

, and all results were based on 1000 replications.

Figure 3. Empirical statistical power of three

L_{2, ω}

measures including

ω (x, y) = 1

(distance covariance),

ω (x, y) = {f (x) f (y)}^{- 1 / 2}

(chi-squared), and

ω (x, y) = \sqrt{\frac{1}{2} (\frac{f (x)}{f (y)} + \frac{f (y)}{f (x)})}

(symmetric mean variance index), under settings 1 and 2. In each setting, sample sizes are

n = 25, 50, 75, 100, 200, 500

, and all results were based on 1000 replications.

Figure 3. Empirical statistical power of three

L_{2, ω}

measures including

ω (x, y) = 1

(distance covariance),

ω (x, y) = {f (x) f (y)}^{- 1 / 2}

(chi-squared), and

ω (x, y) = \sqrt{\frac{1}{2} (\frac{f (x)}{f (y)} + \frac{f (y)}{f (x)})}

(symmetric mean variance index), under settings 1 and 2. In each setting, sample sizes are

n = 25, 50, 75, 100, 200, 500

, and all results were based on 1000 replications.

Table 1. An example that X and Y are not perfectly associated, but

L_{\infty}^{*} (X, Y) = 1

.

Table 1. An example that X and Y are not perfectly associated, but

L_{\infty}^{*} (X, Y) = 1

.

	Y = 1	Y = 2	Y = 3
X = 1	1/2	0	0
X = 2	0	1/8	1/8
X = 3	0	1/8	1/8

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q. A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance. Entropy 2019, 21, 990. https://doi.org/10.3390/e21100990

AMA Style

Zhang Q. A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance. Entropy. 2019; 21(10):990. https://doi.org/10.3390/e21100990

Chicago/Turabian Style

Zhang, Qingyang. 2019. "A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance" Entropy 21, no. 10: 990. https://doi.org/10.3390/e21100990

APA Style

Zhang, Q. (2019). A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance. Entropy, 21(10), 990. https://doi.org/10.3390/e21100990

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance

Abstract

1. Introduction

2. Methods

2.1. A Class of Association Measures for Categorical Variables

2.2. Sample Estimate and Independence Test

2.3. Scaled Forms of Unweighted Measures

3. Numerical Study

4. Discussion

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Technical Details

Appendix A.1. Equivalency between Two Definitions of Distance Covariance

Appendix A.2. Derivation of Equation (3)

Appendix A.3. Derivation of the Modified Mean Variance Index

Appendix A.4. Proof of Theorem 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI