Learned Practical Guidelines for Evaluating Conditional Entropy and Mutual Information in Discovering Major Factors of Response-vs.-Covariate Dynamics

We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics’ data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data’s categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon’s conditional entropy (CE) and mutual information (I[Re;Co]) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and I[Re;Co] in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.


Introduction
The majority of scientific fields, such as biology [1], neuroscience [2], medicine, sociology and psychology [3] and many others [4], involve dynamics of complex systems [5,6]. Scientists and experts in such fields typically can only imagine or even briefly outline various potential response-vs.-covariate (Re-Co) relationships in an attempt to characterize the dynamics of their complex systems of interest [7]. Given that no explicit functional form of such Re-Co relationships is available, such scientists still go ahead and collect structured data sets by investing great efforts in choosing which features for the role of response variable, and which features for the role of covariate variables. Such choices of features are indeed critical for the sciences because their successes rely entirely on whether such structured data sets can embrace the essence of the targeted Re-Co dynamics or not.
After scientists achieve their scientific quests by generating structured data sets upon the complex systems of interest, it becomes not only very natural, but also very important to ask the following specific question: When such structured data sets are in the data analysts' hands, what is the most essential common goal of data analysis? This goal is certainly not aimed at an explicit system of equations, nor at a complete set of functional descriptions of the targeted Re-Co dynamic. Instead, this goal can and shall be oriented to decode the scientists' authentic knowledge and intelligence about the complex systems of interest, and one step further to go beyond the current state of understanding.
In sharp contrast, nearly all statistical model-based data analyses on any structured data sets pertaining to wide-range of Re-Co dynamics always assume an explicit functional structure linking the response variables to covariate variables, including hypothesis testing [8], analysis of variance (ANOVA) and the many variants of regression analysis [9,10], including generalized linear models and log-linear models [11,12]. By framing rather complex Re-Co dynamics with rather simplistic explicit functional structures, statistical model-based data analysis surely will run the dangers of hijacking data's authentic information content. With such dangers in mind, it is natural to ask the reverse question: What if we can reformulate all fundamental statistical tasks to fit under a framework of response-vs.-covariate (Re-Co) dynamics without explicit functional forms and extract data's authentic information content of data sets?
As the theme of this paper, we demonstrate a positive answer to the above fundamental question. The chief merits of such demonstrations are that we not only can do nearly all data analysis without statistical modeling, but more importantly we can reveal data's authentic information content to foster true understanding about the complex systems of interest. Our computational developments are illustrated through a series of 6 well-known statistical topic issues with increasing complexity. All successfully revealed information content is visible and interpretable.
The positive answer resides in the paradigm called Categorical Exploratory Data Analysis (CEDA) with its heart anchored at a major factor selection protocol, which has been under developing in a series of published works [13][14][15][16] and a recently completed work [17]. For demonstrating the positive answer, this paper establishes practical guidelines for evaluating Theoretical Information Measurements, in particular Shannon's conditional entropy (CE) and mutual information between the response variables and covariate variables, denoted as I[Re; Co] [18], which are the basis of CEDA and major factor selection protocol.
Along the process of establishing such computational guidelines, we characterize four theme-components in CEDA and the major factor selection protocol: TC-1. Our practical guidelines are established here for evaluating CE and I[Re; Co] without requiring consistent estimations of their theoretical population-version of measurements. TC-2. All entropy-related evaluations are carried out on a contingency table platform, so learned practical guidelines also provide ways of relieving from the effects of the curse of dimensionality and ascertaining for [C1:confirmable] criterion, which is a kind of relative-reliability. TC-3. CEDA is free of man-made assumption and structures, so consequently its inferences are carried out with natural reliability. TC-4. CEDA only employs data's categorical nature, so the confirmed collection of major factors indeed reveals data's authentic information content disregarding data types.
The theme-component [TC-1] allows us to avoid many technical and difficult issues encountered in estimating the theoretical information measurement [19,20]. [TC-1] and  together make CEDA's major factor selection protocol very distinct to model-based feature selection based on mutual information evaluations [21][22][23][24], while [TC-3] makes CEDA's inferences realistic, and  makes CEDA to provide authentic information content with very wide applicability.
For specifically illustrating these four theme-components, we consider a structured data set consisting of data points that are measured and collected in a L + KD vector format with respect to L + K features. The first L components are the designated response (Re) features' measurements or categories, denoted as Y = (Y 1 , . . . , Y L ) , and the rest of K components are K covariate (Co) features' measurements or categories, denoted as {V 1 , . . . , V K }. It is essential to note that some or even all covariate features could be categorical. Thus, data analysts' task is prescribed as precisely extracting the authentic associative relations between Y and {V 1 , . . . , V K } based on a structured data set.
By extracting authentic associations between response and covariate features, various Theoretical Information Measurements are employed under the structured data setting in [13][14][15][16][17]. In particular, Re-Co directional associations developed in CEDA and its major factor selection protocol rely on evaluations of Shannon conditional entropy (CE) and mutual information (I[Re; Co]) that are all carried out upon the contingency table platform. This platform is indeed very flexible and adaptable to the number of features on row-and column-axes as well as the total size of data points. Such a key characteristic makes CEDA very versatile in applicability. We explain in more detail as follows.
On the response side, a collection of categories of response features (pertaining to Y) is determined with respect to their categorical nature and sample size. Likewise, on the covariate side, a collection of categories for each 1D covariate feature (pertaining to V k for k = 1, . . . K) is chosen accordingly. It is noted that a continuous feature is categorized with respect to its histogram [25]. If L > 1, then the entire collection of response categories will consist of all non-empty cells or hypercubes of LD contingency tables. However, when L is large, the total number of LD hypercubes could be too large for a finite data set in the sense that many hypercubes are occupied by very few data points. This is known as the effect of the curse of dimensionality. To avoid such an effect, clustering algorithms, such as Hierarchical clustering or K-means algorithms, can also be performed for fusing the L response features (upon their original continuous measurement scales or their contingency tables when involving categorical ones) into one single categorical response variable. The number of categories can be pre-determined for K-means algorithm or determined by cutting a Hierarchical clustering tree in a fashion such as there is only one tree branch per category. The essential idea behind such feature-fusing operations is to retain the structural dependency among these L response features, while at the same time reducing the detrimental effect of the curse of dimensionality.
In contrast, singleton and joint (or interacting) effects of all possible subsets of {V 1 , . . . , V K } are theoretically potential on the covariate side. However, it is practically known that any high order interacting effects needed to be considered are to a great extent determined by the sample size. That is, a covariate-vs.-response contingency table platform can vary greatly in dimensions: large or small. When viewing a contingency table as a high-dimensional histogram, which is a naive form of density estimation, the curse of dimensionality, or so-called finite sample phenomenon, is supposed to affect our conditional entropy evaluations whenever this table's dimension is large relative to data's sample size. We use the notation C[A − vs. − Y ] (rows-vs.-columns) for a contingency table of a covariate variable subset A ⊆ {V 1 , . . . , V K } and response variable Y. As a convention, the categories of Y are arranged along its column-axis, while the categories of A are arranged along the row-axis. This row-axis would expand with respect to memberships of A.
In CEDA, the associative patterns between any A ⊆ {V 1 , . . . , V K } and Y would be discovered and evaluated ucing the contingency table C[A − vs. − Y ]. It is necessary to reiterate that C[A − vs. − Y ] can be viewed as a "joint histogram" or "density estimation" of all features contained in A and Y. From this perspective, when the dimension of C[A − vs. − Y ] increasingly expands as A including more variables, it is expected that its dimensionality would affect the comparability and reliability of conditional entropy evaluations. Consequently, for comparability purposes, this criterion [C1:confirmable] in CEDA arises. This criterion is based on a so-called data mimicking operation developed in [14], as will be described in the following paragraphs.
LetÃ denote one mimicry of A in the ideal sense of having the same deterministic and stochastic structures. In other words,Ã is generated to have the same empirical categorical distribution of A, see [14] for construction details. More practically speaking, if the empirical categorical distribution of A is represented by a contingency table, then, given the observed vector of row-sums,Ã would be another contingency table that has the same lattice dimension and all its row-vectors are generated from Multinomial distribution with parameters specified by the corresponding row-sum and the corresponding vector of observed proportions in A's contingency table. It is noted thatÃ is constructed independent of Y, that is,Ã is stochastically independent of Y [14].
Denote the mutual information of Y  [19] and computational protocols based on biGamma function in [20]).
Here, we do not take the view of contingency table as a setup of Grenander's Method of Sieves (MoS) [26] in this paper. Though MoS can be a choice for practical reasons and computing issues involving many dimensional features or variables, we do not concern primarily on estimating the population-versions of CEs and I[Re; Co] per se, nor the induced sieves biases. Rather, the dimensions of contingency tables are made adaptable to the necessity of accommodating multiple covariate feature-members in A. Within such cases, the collection of categories of A might be built based on hierarchical or K-means clustering algorithms. From this perspective, computations for theoretical conditional entropy and mutual information between multiple dimensional covariate and possibly multi-dimensional Y are neither realistically nor practically possible, due to the limited size of the available data sets. Since this kind of sieves is data dependent, the computations for sieve biases can be much more complicate than that covered in [19].
In this paper, we illustrate and carry out CEDA coupled with its major factor selection protocol through a series of 6 classic statistical topic examples, within each of which various scenarios are also considered. By building contingency tables across various dimensions with respect to different sample sizes, we attempt to reveal the robustness of CEDA resolutions to statistical topic issues. On one hand, we learn practical guidelines of evaluating conditional (Shannon) entropy and mutual information along this illustrative process. On the other hand, we demonstrate that very distinct CEDA resolutions to these classic statistical topic issues can be achieved by coherently extracting data's authentic information content, which is the intrinsic goal of any proper data analysis. That being said, if modeling is indeed a necessary step within a scientific quest, then data's authentic information content surely will better serve its purpose by relying on confirmed structures to begin with a new kind of data-driven modeling.
At the end of this section, we briefly project the applicability of our CDA approach for data analysis related to complex systems. One critical application is in a case-control study. Since such studies likely involve multiple features of any data types as often conducted in medical, pharmaceutical, and epidemiological research. Another critical application of CEDA is to serve as an alternative approach to all kinds of regression analysis techniques based on linear, logistic, log-linear, or generalized linear regression models. Such modelingbased analyses are often required and conducted in biological, social, and economic sciences, among many other scientific fields. Furthermore, in our ongoing research, we look into the issue of how well CEDA would deal with causality issues. Addiotonally, with such a wide spectrum of applicability, we project that CEDA will become an essential topic of data analysis education in the fields of statistics, physics, and beyond in the foreseeable future.

Estimations of Mutual Information between One Categorical and One Quantitative Variables
In this section, we demonstrate how to resolve classic statistical tasks by discovering major factors based on entropy evaluations. First, we frame each classic statistical task into precisely stated Re-Co dynamics. Secondly, we compute and discover major factors underlying this Re-Co dynamics. Inferences are then performed under [C1:confirmable] criterion across a spectrum of contingency tables with varying designed dimensions. Thirdly, we look beyond the setting of the discussed examples to much wider related statistical topics.
Throughout this paper, all 95% confidence ranges (CR) are calculated as the region between 2.5% percentile on the lower tail and 97.5% percentile on the upper tail of any simulated distribution. This CR reflecting both tail behaviors is considered informative. Since even when the upper tail is the only quantity of interest as being the case in this paper, the classic one-sided 97.5% confidence interval becomes visible.
2.1. [Example-1]: From 1D Two-Sample Problem to One-Way and Two-Way ANOVA Consider a data set consisting of quantitative observations {Y lj |l = 1, 2; j = 1, . . . , N i } of 1D response feature Y derived from two populations labeled by l = 1, 2, respectively. Let Y lj be distributed according to F l (.). Testing the distributional equality hypothesis H 0 : F 1 (y) = F 2 (y), ∀y ∈ R 1 is the most fundamental topic in statistics. Under this setting, the only covariate V 1 is the categorical population-ID taking values in {1, 2}. The testing hypothesis problem and its subsequent ones can be turned into an equivalent problem: Is V 1 a major factor underlying the Re-Co dynamics of Y? If V 1 is not a major factor, then H 0 is accepted. If H 0 is indeed rejected by confirming V 1 being a major factor, then we would further want to discover where they are different.
For the illustrative simplicity, let Y 1j ∼ N(0, 1) and Y 1j ∼ N(1, 1) with j = 1, . . . , N/2, that is, N 1 = N 2 . From a theoretical information measurement perspective, the theoretical value of entropy of Y is calculated being equal to H[Y] = 1.5321, and its conditional entropy ] = 0.1132. By V 1 being a major factor of Y, we mean that the V 1 is not replaceable by other covariate variables that is stochastically independent of Y, such as fair-coin-tossing random variable ε. That is, we theoretically establish this fact by knowing In the real world, the two population-specific distributions F 1 (.) and F 2 (.) are often unknown. To accommodate this realistic setting, we build a histogram, sayF(.), based on pooled observed dataset {Y ij |i = 1, 2; j = 1, . . . , N i }. With a chosen version ofF(.) with K bins, we can build a 2 × K contingency table, denoted by C[V 1 − vs. − Y]. Its two rows correspond to two population-IDs and all K bins with column-sums n k , k = 1, . . . K being arranged along the column-axis. That is, C[V 1 − vs. − Y] keeps the records of popultion-IDs for all members within each bin ofF(.), and enable us to estimate the mutual information: All estimates of I[Y; V 1 ] would be compared with estimates of I[Y; ε] from 2 × K contingency tables generated as follows: its kth column with k = 1, . . . , K simulated from a binomial random variable BN(n k , P 0 ) with P 0 = (N 1 /N, N 2 /N) . This comparison of I[Y; V 1 ] with I[Y; ε] is a way of testing whether a major factor candidate satisfies the criterion [C1: confirmable] in [15]. Precisely this testing is performed by comparing the observed estimate of I[Y; V 1 ] with respect to the simulated distribution of I[Y; ε].
To make our focal issue concrete and meaningful, we undertake the following simulation study, in which the reliability issue of H[Y|V 1 ] estimation is addressed, and at the same time [C1: confirmable] is tested. Recall that Y 1j ∼ N(0, 1) and Y 1j ∼ N(1, 1) with j = 1, . . . , N/2. We consider two cases of N = 2000 and N = 20,000. For practical considera-tions with respect to the infinity range of Normality, we choose K = K + 2 bins for building a histogram via a 1 + K + 1 fashion. The observed 90% quantile range [F −1 . We use 5 choices of K ∈ {10, 20, 30, 100, 1000}. For each K value, the estimated Shannon entropy H (K) [Y] and conditional entropies H (K) [Y|V 1 ]. Also, a 95% confidence range (CR) of I[Y; ε] is also simulated and reported based on an ensemble of As reported in the table Table 1, it is evident that the mutual information  In summary, Table 1 indicates that the estimate of the mutual information of I[Y|V 1 ] is far above the 95% confidence range under the null hypothesis within each of all 5 choices of K under the two cases of N. 9 out of 10 cases have almost 0 p-values, except the 1 + 1000 + 1 case with N = 2000. These facts indicate one common observation: when all bins contain at least 20 data point, the estimate of I[Y|V 1 ] is reasonably stably and practically valid. That is, we only need a stable and valid estimate of I[Y|V 1 ] for the purpose of confirming a major factor candidacy.
In fact, it is surprising to see that, even when K = 1000 in the case of N = 2000, confirmable] criterion by going beyond the upper limit of the 95% confidence range of I[Y; ε]. This fact implies the correct decision is still being retained because V 1 is confirmed as a major factor. These observations become crucial when estimations of I[Y|V 1 ] are facing the effects of the curse of dimensionality, also called finite sample phenomenon.
As V 1 being determined as a major factor underlying the dynamics of Y and the hypothesis H 0 is rejected, we then can check which of K + 2 bins' observed entropies fall inside or outside of bin-specific entropy-confidence-ranges built by simulated counts via BN(n k , P 0 ) across k = 1, . . . , K + 2. By doing so, we discover where F 1 (.) and F 2 (.) are different locally.
Next, one very interesting observation is found and reported in Table 1: values of . We explain how this observation occurs. Let f (y) = F (y) be the hypothetical density function of random variable Y with observed values {Y lj |l = 1, 2; j = 1, . . . , N/2}. Based on fundamental theorem of calculus, for each K, we have the theoretical Shannon entropyH(Y) is approximated as: where y * k s denote inter-middle values in Mean Value Theorem of Calculus and (K) = And we have (10) = J (J × 10).
with J = 2, 3, 10 and 100. Therefore, we have the approximating relations as: After some subtractions, the differences are close to log 2, log 3, log 10 and log 100, which matches with numbers shown in the 3rd column of Table 1.
By the same reason, these relations hold for estimated conditional entropies as well. That is, we also have: for all Ks, when all involving bins have 30 or so data points, as seen in 4th column of Table 1. This is the reason why that we see estimated values of I (K) [Y; V 1 ] being nearly constant (w.r.t K) when K = 10, 20, 30 with N = 1000 and K = 10, 20, 30, 100 with N = 10,000. This is a critical fact that we can employ mutual information estimates with reliability. Thus, we use the notation I[Y; V 1 ] from here on, instead of Here we further remark that the two-sample hypothesis testing problem (L = 2) setting can be extended into the so-called multiple-sample problem (L > 2) . Correspondingly, categorical variable V 1 of population-IDs is equipped with L categories. This hypothesis testing: retains the same equivalent formulation as: Is V 1 a major factor underlying the dynamics of Y? This multiple-sample problem is also known as one-way ANOVA, which is one fundamental topic problem in Analysis of Variance. Another fundamental topic problem in Analysis of Variance is represented by twoway ANOVA, which involved two categorical covariate features: V 1 and V 2 . Let these two covariate features have L 1 and L 2 categories, respectively. Within a population with V 1 = l and V 2 = h, measurements Y lhj are distributed with respect to F lh (.) with l = 1, . . . , L 1 and h = 1, . . . , L 2 .
The classic two-way ANOVA setting is specified by assuming Normality distribution Y lhj N(µ lh , σ 2 ) and µ lh satisfying the following linear structure: with µ as the overall effect, α l s the effects of V 1 , beta h s as effects of V 2 , and γ lh s as interacting effects of V 1 and V 2 . These effects parameters are to satisfy the following linear constraints: It is evident that this classic two-way ANOVA formulation is rather limited in the sense of excluding the possibility that Y lhj does not have an informative mean, such as non-normal distributions with heavy tails or more than one mode, or even lacking of the concept of mean, such as a categorical variable.
A much widely extended two-way version is given as follows: is unknown global function consisting of the following unknown componentwise mechanisms: the unknown component mechanism M 1 (V 1 ) having V 1 as its order-1 major factor; another unknown component mechanism M 2 (V 2 ) having V 2 as its order-1 major factor; and the unknown interacting component mechanism M 12 (V 1 , V 2 ) with (V 1 , V 2 ) as its order-2 major factor. Our goal of data analysis under this extended version is again reframed as computationally determining whether these order-1 and order-2 major factors are present or not underlying the Re-Co dynamics of Y against the covariate features V 1 and V 2 . If both covariate features V 1 and V 2 are independent or only slightly dependent with each other, the right major factor selection protocol can be found in [15]. However, if they are heavily associated, a modified major factor selection protocol can be found in [17]. We conclude this Example-1 with a summarizing statement: a large class of statistical topics can be rephrased and reframed into a major factor selection problem, and then this problem is resolved commonly by evaluating mutual information estimations that are not required to be precisely close to its unknown theoretical value.

[Example-2]: From Dealing to Lessening the Effects of Curse of Dimensionality
It is noted here that, mutual information I[Y; V 1 ] has another representation This presentation is valid even for a categorical variable V 1 . Based on this representation, we can clearly see the scale-free property of mutual information with respect to various choices of histograms. Nonetheless, we refrain from using this definition for estimating I[Y; V 1 ]. Since this definition-based estimation involves the estimation of joint distribution of (Y, V 1 ), which is a harder problem due to its dimensionality. This so-called curse of dimensionality would become self-evident later on in our developments when the response variable Y and its covariate features (V 1 , . . . , V k ) are both multiple dimensional. The task of estimating multiple dimensional density becomes neither practical, nor reliable, given an ensemble of finite sample data points.
In this subsection, we demonstrate how to effectively deal with the effects of the curse of dimensionality. We consider again a two-sample problem, but having multiple dimensional data points, not single dimensional ones as in Example-1. Again we denote two populations with IDs: V 1 = 0 and 1. Data points from these two populations are denoted as . . , Y m ) denote the multiple dimensional response variable. To resolve the same task of testing whether these two populations are equal with m components possibly highly associative features, what would be the best way of building up the contingency table for the purposes of estimating the I[Y; V 1 ] for testing the hypotheses?
We expect the equal-bin-size and equal-bin-area approaches for component-wise histograms are neither ideal nor practical due to the curse of dimensionality. On the other hand, we know that the clusters of m-dim data points can naturally retain the dependency structures. Hence, it is intuitive to employ results of clustering algorithms to differentiate patterns of structural dependency within Y 0 and Y 1 . This intuition leads to the important merit of cluster-based contingency table as a way of lessening effects from the curse of dimensionality. We illustrate these ideas through two samples of simulated multivariate Normal-distributed data described as follows.
Let m = 4 and two mean-zeros Normal distributions: The Shannon entropies of these two 4D Normal distributions via the following formula with d = 4: 1/2 log(det(Σ)) + d/2(1 + log (2π) Through an extra experiment using 100 millions of data points, we end with a negative estimate of the mutual information. This failed attempt in fact further provides a vivid clue of the effect of curse of dimensionality. In other words, we need to resolve such an effect by staying away from the rigid 4D hypercubes.
In contrast, we demonstrate that the cluster-based approaches are potentially reasonable choices to mend this effect of the curse of dimensionality. Consider two commonly used clustering algorithms: Hierarchical clustering (HC) and K-means algorithms. It is also known that the HC algorithm is computationally more costly than the K-means algorithm. Since the HC-algorithm heavily relies on a distance matrix, HC-algorithm has difficulties in handling a data set with a very large sample size. Recently, very effective computing packages have been developed for K-means algorithm, that is, K-means algorithm can be effectively applied. On top of computing efficiency differences, there exists a critical difference between the two algorithms. The K-means provides much more even cluster-sizes than HC-algorithm does as illustrated in Figure 1, see also Figure 2. For these reasons, we employ K-means clustering, not Hierarchical clustering (HC), algorithm in the following series cases with m = 2, 3, 4.
In this experiment, we take ρ 0 = 0.5 and ρ 1 = 0.7 under two settings with N = 2000 and N = 20,000. It is noted that the differences in ρ 0 values imply the differences in distribution shapes. The series of clustering compositions are constructed as follows. We apply the K-means algorithm to derive a series of clustering compositions with 12, 22, 32 and 102 clusters. Correspondingly, we built a series of contingency tables of the formats: (1) 2 × 12; (2)   The messages derived from Example-1 are also observed in Example-2 across 2D to 4D settings in Tables 2-4. These results clearly indicate that distribution shape differences can be effectively and reliably picked up by entropy-based evaluations of mutual information between the Y and categorical label variable V 1 . These results imply that we widely extend one-way ANOVA and two-way ANOVA settings to accommodate high dimensional data points as we have argued in Example-1.  In order to better understand the limit of such an entropy-based approach, we twist the 2D setting in Example-2 a little bit. This more complicated version of Example-2, denoted as Example-2 * , consists of one 2D normal mixture and one 2D normal. These two 2D distributions are further made to have equal mean vector and covariance matrix. Furthermore, two kinds of mixture-settings are designed and used. The first setting of Example-2 * is designed for a mixture of two relatively close 2D normals with mean vectors: (0.5, 0.5) and (−0.5, 0.5). The second setting is designed for a relatively apart normal mixture with mean vectors: (−1, −1) and (1, 1). These two settings of pairwise scatter-plots are given in Figure 3. It is obvious that we can visually separate the two 2D distributions in the second mixture setting, but can not do equally well in the first mixture setting. The mutual information estimates and confidence ranges under the null hypothesis are calculated and reported in Table 5. In the first mixture setting, it is apparent that V 1 fails to be a major factor by failing to satisfy the criterion [C1: confirmable] across all K choices. This result is coherent with our visualization through the upper panel Figure 3. As for the 2nd mixture setting, V 1 is claimed as a major factor by satisfying the [C1: confirmable] criterion across all K choices. This result is also coherent with our visualization through the lower panel Figure 3. Further, we observe that the relative position of I[Y; X] estimates against upper and lower limits of null confidence ranges are rather stable when the sizes of clusters are not too small. This observation indeed provides us with the practical guideline for varying choices of K according to different sample sizes when we employ mutual information to perform inferences under Re-Co dynamics. We conclude this Example-2 (Example-2 * ) with a summarizing statement: Though, any theoretical evaluations of mutual information under the presence of high dimensionality are practically impossible, clustering algorithms provide practical guidelines for building contingency tables and evaluating mutual information for inferential purposes by lessening the effects of curse of dimensionality.

[Example-3]: From Linear to Highly Nonlinear Associations
We then turn to consider the simplest one-sample problem involving dependent 2D data points. The framework of Re-Co dynamics is self-evident. In this example, we examine the validity and performances of inferences based on estimated mutual information between two 1D continuous random variables Y and X via contingency tables of various dimensions. For simplicity in the first scenario of Example-3, we consider a bivariate normal (Y, X) ∼ N(0, Σ) with covariance matrix: Here the correlation coefficient ρ is taken to be 0.0 and 0.5, respectively, in this experiment with N = 1000 or 10,000. The contingency tables are derived from the K-means algorithm being applied on X and Y, respectively, with a series of pre-determined numbers of clusters: {12, 22, 32, 102}. For the setting of ρ = 0, we report the calculated I[Y; X] and confidence range of I[Y; ε] in Table 6 across the 16 dimensions of contingency tables. The smallest size of the contingency table has 144(= 12 × 12) cells. Its average cell-count is less than 14 for N = 2000. The largest size of the contingency table is 102 × 102, which is more than 10 4 . Its averaged cell-count is less than 2 for N = 20,000.
From the upper half of Table 6 for the N = 2000, all estimates of I[Y; X] are beyond the upper limit of 95% confidence range of I[Y; ε]. That is, the hypothesis of Y and X being independent is falsely rejected. In contrast, from the lower half of Table 6 for the N = 20,000, all estimates of I[Y; X] are either below the lower limit of 95% confidence interval of I[Y; ε] or within confidence range, except the results based on the largest 102 × 102 contingency table. That is, the same independence hypothesis would not be falsely rejected except in the case of the largest contingency table. Such a contrasting comparison between the upper and lower halves of Table 6 clearly indicates that validity of mutual information evaluations heavily rely on degrees of volatility of cell counts, especially on testing independence. We further explicitly express such volatility below.
A simple reasoning for the above results goes as follows. For this independent setting of Y and X, for expositional simplicity, let all cells in contingency tables have equal probability. In the smallest contingency table, the cell probability is 1/144. The cell-count is a random variable with mean and variance being very close to N/144 as well. Thus, the cell-count is falling between N/144 ± 2 √ N/144 with at least 95%. With N = 2000, the 95% range is close to [6,22], while with N = 20,000 the 95% range is close to [110,150]. Based on these two 95% intervals, we can see that the Shannon entropy along each row of the 12 × 12 contingency table can be volatile with N = 2000, while it is not the case with N = 20,000. In fact, when N = 2000, a 6 × 6 contingency table indeed provides much more stable evaluations of mutual information.   Table 7 across the 16 dimensions of contingency tables with N = 20,000. We observe that the calculated I[Y; X] is far above the upper limit of the confidence interval of I[Y; ε] even in the largest contingency table with dimension 102 × 102. The reason is that the number of effectively occupied cells are much smaller due to the dependency, that is, many cells supposed to be empty are indeed empty. With many empty cells coupling with many occupied cells with relatively large cell counts, the Shannon entropy is evaluated with great stability. These results from independent and dependent experimental cases are learned to constitute practical guidelines for evaluating mutual information. The second scenario of Example-3 is about whether the calculated mutual information I[Y; X] can reveal the existence of non-linear association between Y and X. We generate two simulated data sets based on two non-linear associations: (1) half-sine function; (2) fullsine function, as shown in the two panels of Figure 4. Within both cases of non-linear associations, it is noted that the correlations of Y and X are basically equal to zero. In the setting of a half-sine functional relation, we report the calculated I[Y; X] and confidence range of I[Y; ε] in Table 8 Table 9. These two settings in this non-linear association scenario together demonstrate that the calculated I[Y; X] can reveal the existence of significant association between Y and X. This demonstration is important in the sense of without knowing the functional forms of their association. We summarize the practical guidelines that we learned from Example-1 through Example-3 in this section. The most apparent fact is that the calculated values of mutual information I[Y; X] vary with respect to dimensions of contingency tables. However, the good news is that the amounts of variations are relatively small and even very minute when cell-counts in the contingency table are not too low. Nonetheless, the calculated mutual information I[Y; X] is very capable of revealing the presence and absence of associations underlying Re-Co dynamics of response variable Y and covariate variable X from the three examples and scenarios considered in this section. And it is a reliable way of seeking consistent inferential decisions by varying contingency tables' dimensions. This capability can be made very efficient if we choose the dimension of the contingency table to suitably reflect the total sample size of the data set with varying degrees. That is, we make sure such efficiency is achieved by varying the dimensions of contingency tables from small to reasonably large. The final guideline is that comparability between two mutual information evaluations is resting on their more or less identical computational platforms, that is, their contingency tables are more or less the same in dimensions. On the other hand, the averaged numbers of cell counts are relatively large, and mutual information evaluations are rather robust to some degree of differences in contingency tables' dimensions. These practical guidelines will ascertain mutual information evaluations always coupled with reliability. Finally, the data-types of Y and X are entirely free because we rely on their categorical nature only.

Examples with Complex Re-Co Dynamics
Next, we consider two examples with Re-Co dynamics wich are more complex than the three examples discussed in the previous section. Through these two examples that havie independent covariate features, we further illustrate the necessity of following the practical guidelines motivated and learned in the previous section.

[Example-4]: From Complex Interaction to Further Beyond
After going through three relatively simple examples in the previous section, we now turn to examples with more complex Re-Co dynamics. Consider a functional relation between Y and {X 1 , . . . X 4 } specified as follows: U[0, 1] and N = 10,000. That is, X 4 plays the role of observable noise random variable, while unobservable noise is N(0, 1)/10. Our goal is to discover the order-1 major factors X 1 and order-2 major factor (X 2 , X 3 ). It is worth noting that this order-2 major factor can not be discovered via linear regression analysis, even when the product type of interacting effect is included in the model.
The response variable Y is categorized with 12 bins, so does each of the 4 covariate features. We calculate mutual information of Y and all possible feature subsets' A ⊆ {X 1 , . . . X 4 }, say I[Y; A]. If |A| = k, we build a (12) k × 12 contingency table for calculating for evaluating I[Y; A]. Here A also stands for a fused categorical variable in the sense that categories of A are all occupied kD hypercubes of its k(= |A|) feature-members.
We compute and report conditional entropies (CEs) for all possible As and arrange them with respect to sizes |A| of A in Table 10. Also we report a term called successive (S) CE-drops defined via the following CEs difference: This SCE term is designed to evaluate the extra effect of CE-drop by including an extra feature-member. The above formula is precise in theory. But in reflecting the aforementioned last practical guideline in the last section, it is essential to note that SCE[Y|A] involves at least two different settings of |A| = k and |A | = k (< k), which correspondingly involve two different dimensions of contingency tables: one is of (12) k × 12 and the other is (12) k × 12. Therefore, based on what we have learned from the previous section, these settings render different scales of conditional entropy and mutual information computations. That is, these different scales will certainly make mutual information evaluations not completely comparable, especially when cell-counts in the contingency tables are overall too small. For instance, The SCE-drop of (X 1 , X 2 ) is more than 10 times of CE-drop of X 2 . It would be a mistake to claim that X 1 and X 2 are conditional dependent given Y. Since the scale in evaluating H[Y|X 1 ] is different from the scale in evaluating H[Y|X 1 X, 2 ]. Nevertheless, since X 4 plays a role of random noise in this example, the information contents of X 1 and (X 1 , X 4 ) are supposed to be very close from the perspective of their contingency This line argument ultimately converges to the following practical guideline on evaluating Information Theoretical measurements via contingency table platform: "these CEs and mutual information measurements are comparable only when they are evaluated under the same dimensions of contingency tables". This guideline indeed is coherent with a statistical concept of conditioning with respect to the observed row-sum vector.
Before summarizing our findings from Table 10, where we reported calculated CEs and SCE drop , we need to prepare baseline-evaluations to make sure that all CEs comparisons are sensible. Here, we recall that C[A − vs. − Y] denotes the contingency table with categories of Y on column-axis and categories of covariate feature subset A on row-axis.
• 1-feature setting: With C[X 1 − vs. − Y] having its proportion vector of row-sums denoted as P X 1 , we build an ensemble of ]] to figure out the amount I[(X 1 , X 2 )|Y] − I[(X 1 , X 2 )]. As for (X 2 , X 3 ), in comparison with SCEs of (X 2 , X 4 ) and (X 3 , X 4 ), its SCE drop is calculated as 0.7781, which is more than 10 times of X 3 's individual SCE drop . This is a very strong indication of the interacting effect of (X 2 , X 3 ) due to evident presence of their conditional dependency given Y. This fact establishes the feature-pair (X 2 , X 3 ) as an order-2 major factor. • 3-feature setting: In Table 10, the SCE drop of feature-triplet (X 1 , X 2 , X 3 ) from featurepair (X 2 , X 3 ) is 0.8431, which is about 3.5 times of CE-drop of X 1 . This observation could seemingly point to the potential presence of conditional dependency of (X 1 , X 2 , X 3 ). However, if we more precisely calculate the effect of X 1 when adding to (X 2 , X 3 ) as: and compare it with H[Y|X 4 , X 5 , X 6 ] − H[Y|X 1 , X 4 , X 5 ] with X 5 and X 6 being independent random variables, which is expected to be larger than 0.2322, but smaller than 0.3901. Therefore, we can only confirm that the ecological effect does exist between X 1 and (X 2 , X 3 ), that is, they can be order-1 and order-2 major factors of Y. But, certainly they don't form conditional dependency underlying Y, see details of major factor selection protocol in [15]. Table 10. Experiment with Y = X 1 + sin(2π(X 2 + X 3 )) + N(0, 1)/10 and N = 10,000. Each categorized 1-features has 12 bins, so a k-feature has (12) k kD hypercubes.

[Example-5]: From High-Order Interaction to Complexity
In order to see the effect of higher order major factor, we change the functional form of Y slightly as: With sample size N = 10,000, our computational results are reported in Table 11. Likewise, we can confirm X 1 as an order-1 major factor and triplet (X 2 , X 3 , X 4 ) as an order-3 major factor. In sharp contrast, the evidence of order-3 major factor seems to disappear when N = 1000, as shown in Table 12. This is the exact demonstration of the effect of finite sample phenomenon, or curse of dimensionality. Do these two contrasting results: presence and absence of order-3 major factor in N = 10,000 and N = 1000, respectively, mean that we should give up looking for high order major factors on small data sets? Table 11. Experiment with Y = X 1 + sin(2π(X 2 + X 3 + X 4 )) + N(0, 1)/10 and N = 10,000. Each categorized 1-features has 12 bins, so a k-feature has (12) k kD hypercubes.  Table 12. Experiment with Y = X 1 + sin(2π(X 2 + X 3 + X 4 )) + N(0, 1)/10 and N = 1000. Each categorized 1-features has 12 bins, so a k-feature has (12) k kD hypercubes. The answer to the above question is negative. That is, somehow we can escape from the curse of dimensionality in our pursuit of high order major factor. Here we demonstrate a way of escaping. We perform K-means clustering on the 3D data points of (X 2 , X 3 , X 4 ) with 12, 36, 72 and 144 clusters, with which we build a new covariate feature X 234 . The CEs of X 234 with respect to the four corresponding contingency tables are reported in Table 13 with Y being categorized in 12 and 32 categories (clusters) via K-means. In the case of 12 clusters on Y, we see that the CE of X 234 is increasing from 20 to 60 standard deviations (sd) away from the mean CE of X ε 234 as the numbers of clusters of X 234 increases from 12 to 144. We observe similar evidence in the case of 32 categories on Y.
We can then confirm X 234 as a new order-1 major factor, which is a condensed version of (X 2 , X 3 , X 4 ). Therefore, we should also claim that (X 2 , X 3 , X 4 ) is indeed an order-3 major factor. This is an important and significant demonstration that we can be sure about the presence of high order major factors even when the sample size is relatively low, that is, the curse of dimensionality is escapable. Table 13. Exploring the presence of X 234 as an order-3 major factor of Y = X 1 + sin(2π(X 2 + X 3 + X 4 )) + N(0, 1)/10 with N = 1000 with respect to 2 and 4 choices of numbers of clusters of Y and X 234 , respectively. The confidence intervals are calculated based on 100 simulations. Further, by contrasting Table 13 with Table 12, the biases of mutual information estimates indeed can be managed by reducing the large number of bins, cells or hypercubes on the covariate side. That is, a small number of clusters can be derived via a clustering approach of choice.

Examples with Complex Re-Co Dynamics with Dependent Covariate Features
In this section, we conduct one experimental Re-Co dynamics defined by linear structures with slightly dependent covariate features as specified below. That is, this experiment is in the classic linear regression domain. However, there are two twists included in this experiment. The first twist is that there exist two almost-collinearity 3D hyper-planes pertaining to two triplets of covariate features. The second twist is that, when a continuous measurement data type is altered into a categorical one, we understand that we discard very fine scale information of measurements often together with some degrees of ordinal relational information. Nevertheless, this act of investment by sacrificing some information in data is necessary for carrying out our CE computations in its quest for critical authentic information content contained in data. On the other hand, it is natural to ask the following question: When linear regression analysis is applied to such a categorized data set, do we naturally expect its conclusions from such an analysis to be close to the true linear structure?
In this section, we investigate the aforementioned two twists in order to understand the general effects of dependence on conditional entropy evaluations, and we also address the above question. The particular focuses are placed on issues linking to validity of Information Theoretical measurements and their reliability evaluations. We would like to demonstrate the comparisons between classical statistics and CEDA's major factor selection upon the quests into Re-Co dynamics.

[Example-6]: From Dependency Induced Complications to Reality
Consider a Re-Co dynamics defined by linear structures with slightly dependent covariate-features: Y = X 1 + X 2 + X 3 + N(0, 1)/10, X 6 = (X 1 + X 2 + X 3 + X 4 + X 5 + N(0, 1)/10)/3, (X 1 , . . . , X 5 , X 7 , . . . , X 10 ) ∼ N(0, Σ), where Σ is a 9 × 9 covariance matrix (not including X 6 ). Features {X 7 , X 8 , X 9 , X 10 } play the roles of unrelated, but dependent noise. The design of this Example-6 is to have a seemingly dominant order-1 major factor candidate: feature X 6 . We want to explore whether we could discover the true structure underlying the RE-Co dynamics that is a collection of 3 order-1 major factors: {X 1 , X 2 , X 3 }, or not. Also we would like to see what realistic computational issues are generated from the dependency among all covariate features.
One million 11dim data points are simulated and collected as the data set. We apply our CE computations by having all 1D covariate features and the response features are categorized to have 22 bins via the same scheme used in the previous section. CEs are calculated for all possible feature-sets via the contingency table platform. For expositional purposes, we only report 10 CE-values for 10 key characteristic feature-sets across 1-feature to 6-feature settings in Table 14. The summary of our findings based on major factor selections are reported below.
instance, for comparison purpose, we perform LASSO regressions, which is specified in the following Lagrangian form: As shown in Figure 5, the joint presence of {X 1 , X 2 , X 3 , X 6 } are seen for all λ falling within (0, 0.8). Specifically, the observed pattern is that parameters of members of {X 1 , X 2 , X 3 } are linearly decreasing from 1, while parameter of X 6 is increasing from 0 also linearly. Such linearity is primarily due to the penalty λ. All such trajectories of beta are not correct for the Re-Co dynamics except when λ = 0, which only reports the result regarding {X 1 , X 2 , X 3 }, but not (X 4 , X 5 , X 6 ). Table 14. Example-6 with N = 10 6 . Each categorized 1-features has 22 bins, so a k-feature has (22) k kD hypercubes.  We conclude that, though the LASSO with manmade penalty constraints seemingly coupled with some desirable interpretations, its optimization protocol clearly can not handle a landscape having two equally probable "deep-wells". In sharp contrast, our major factor selection protocol has no problems at all in identifying and confirming two collections of three order-1 major factors, and these two collections can not co-exist. This result is reiterated in the next subsection as well. This capability is the chief merit of employing Information Theoretical measures in major factor selection.
Further, we conduct the least squares estimation based on all categorized data, and report the results in Table 15. We can see that the results of estimations give rise to mixed-up and wrong linear structures. That is, the categorizing scheme, which heterogeneously alters locations and scales of original data, has indeed destroyed data's intrinsic characteristics. From this perspective, we understand that the categorical nature of data is suitable for Information Theoretical Measures, but not for linear regression models and its variants.

Escaping from the Curse of Dimensionality
In Example-6, the 6-feature setting, the feature-set {(X 1 , X 2 , X 3 , X 4 , X 5 , X 6 )} achieves the largest CE among all possible feature-sets, which is at least 7 times of CE of {(X 1 , X 2 , X 3 , X 7 , X 8 , X 9 )}. Such comparisons are invalid due to finite sample phenomenon or curse of dimensionality. Since there are more than 1.408 billions ( (22) 7 ) 7D hypercubes for just one million data points. How can we escape from the potential effects of curse of dimensionality on estimations of CEs of {(X 1 , X 2 , X 3 , X 4 , X 5 , X 6 )} and {(X 1 , X 2 , X 3 , X 7 , X 8 , X 9 )}?
Again, we apply the simple approach of K-means clustering algorithm. We first apply K-means algorithm to have 22 clusters based on one million of 3D data points of {(X 1 , X 2 , X 3 )}, {(X 4 , X 5 , X 6 )} and {(X 7 , X 8 , X 9 )} , respectively. We specifically denote these three categorical variables as X 123 , X 456 and X 789 , respectively. Upon these three new covariate variables, we calculate CEs (of Y) under 1-feature and 2-feature settings, see Table 16. We consistently confirm that X 123 and X 456 are not conditionally dependent given Y. Therefore, the two feature triplets (X 1 , X 2 , X 3 ) and (X 4 , X 5 , X 6 ) are two separate chief and alternative collections of three order-1 major factors.

Conclusions
The most fundamental concept underlying all practical guidelines we have learned from the series of increasingly complex examples in this paper is: the comparability of evaluations of conditional entropy and mutual information critically rests on the equality of the dimensions of the contingency tables where these evaluations are carried out. Based on this comparability concept, the focal goal of the data analysis is then rephrased in terms of [C1: confirmable] criterion regrading presence and absence of major factors underlying a designated Re-Co dynamics. In other words, it is absolutely essential to note that there is no need for precise theoretical information measurements in real data analysis. Such [C1: confirmable] criterion pertaining to the discovery of major factor subsequently promotes all practical guidelines being centered around the task of confirming and debunking an existential collection of major factors of various orders. Since the presence and absence of such an existential collection of major factors indeed manifest the data's authentic information content, from a data's information content perspective, the task of data analysis as a whole is translated into the single issue of major factor selection.
Furthermore, all practical guidelines on evaluating mutual information, in particular, for our major factor selection protocol are largely recognized for ascertaining the [C1: confirmable] criterion against the effects of the curse of dimensionality or finite sample phenomenon. Practically, we learn to be sensitively aware of dangers of having low cell-counts in potentially occupied cells when evaluating entropy measures. We also develop clustering-based approaches to lessen the effect of the curse of dimensionality. After learning all these practical guidelines, we are confident in our applications of our major factor selection protocol and related Categorical Exploratory Data Analysis (CEDA) techniques on analyzing real-world structured data sets.
In many scientific fields, like biology, medicine, psychology and social sciences, many measurements are not always precisely metric. Even within a metric system, a continuous measurement is often grouped and converted into a discrete or even ordinal data format. That is, very fine-scale details of a data point is likely given up because it is either too costly to measure, or even can't be measured, or needs to be discarded for practical computational considerations. Therefore, any structured data set is likely consisting of some features having incomparable measurement scales and some features having no scales at all. How to analyze such a data set in a coherent fashion is not at all a simple task. CEDA is a data analysis designed to be coherent with all features' measurements. So, CEDA and its major factor selection protocol are developed to indeed embrace the ideal concept: Each single feature must allow to contribute its own authentic information locally, and then to congregate and weave patterns that reveal heterogeneity on global, median and fine scales levels.
To facilitate and carry out such a fundamental concept of data analysis, CEDA is exclusively resting on one simple fact: All data-types are embedded with the categorical nature. So all pieces of local information derived from all categorical or categorized features must be comparable. All these information pieces can be then woven together for the multiscale heterogeneity. By doing so, there are no man-made assumptions or structures needed in CEDA. So, information brought out by CEDA is authentic. That is, we can be free from the danger of generating misinformation via data analysis involving unrealistic assumptions or structures.
To achieve the aforementioned goals of CEDA via our major factor selection protocol, we definitely need stable and creditable evaluations of conditional entropy and mutual information underlying any targeted Re-Co dynamics of interest. That is why the practical guidelines learned in this paper become essential and significant. On the other hand, these practical guidelines also reveal aspects of flexibility and capability of CEDA and its major factor selection in helping scientists to extract intelligence from their own data sets.
As a final remark, we clearly demonstrate in this paper that, by reframing many key statistical topics in one Re-Co dynamics framework, CEDA and its major factor selection protocol not only can resolve the original data analysis tasks, but also, more importantly, can shed authentic lights on issues related to widely expanded frameworks containing the original statistical topics. This capability manifests the capability of CEDA and its major factor selection protocol for truly accommodating and resolving real-world scientific problems.
Finally, we conclude that the learned practical guidelines for evaluating CE and I[Re; Co] would allow scientists to effectively carry out CEDA and its major factor selection protocol to extract data's visible and authentic information content, which is taken as the ultimate goal of data analysis.