# Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Notes on Information Theory

## 4. The G-Test

## 5. The G Statistic and dcMI

## 6. The Iterative Parent-Child Markov Blanket Algorithm

#### 6.1. IPC–MB and AD–Trees

- eliding AD–nodes for the Most Common Value (MCV) whenever creating the children of a Vary–node; MCV elision is a significant optimization for AD-trees because it prevents the expansion of the subtree under the AD–node with the highest sample count among the children of a Vary–node. This optimization relies on the assumption that a higher sample count would require a subtree with more nodes to be generated. This is because it is more probable to encounter a greater variety of samples in a larger subset of the samples, and a greater variety of samples implies that AD–nodes will often contain positive (non-zero) sample counts, even at greater depths of the subtree. To recover the information elided by an unexpanded subtree, the querying process must reconstruct missing sample counts at runtime, a more complex procedure;
- replacing as many small subtrees as possible at the deeper levels with Leaf-List nodes; Leaf-List nodes prevent the expansion of subtrees when their total sample count drops under a manually given threshold but they keep references to (or copies of) the original samples in the data set. This threshold is named the Leaf-List Threshold (LLT). In case a query needs to descend into a subtree replaced by a Leaf-List node, it will count the referenced raw samples instead, directly from the data set.

#### 6.2. IPC–MB and dcMI

- the features required for the missing joint entropy term are retrieved from the data set;
- their joint probability mass function is computed;
- their joint entropy is computed;
- the joint entropy is stored in the JHT under a key that uniquely identifies the specific subset of features; and
- the new joint entropy value is now available to compute the G statistic.

## 7. A Comparative Experiment

- GDefault, using an unoptimized G-test implementation, which had to count raw samples in the data set for calculating each CI test; this is the default configuration of IPC–MB as proposed by Fu and Desmarais [6]; this configuration is only provided as reference, where feasible;
- GStADt, using a G-test implementation which retrieves information from a pre-built static AD–tree instead of reading the data set directly (except when Leaf-List nodes are built, as described in Section 6.1); this configuration cannot be run unless the static AD–tree has been already built on the chosen data set; this configuration allows the variation of the LLT parameter (see Section 6.1); this configuration is only provided as reference, where feasible;
- GDyADt, using a G-test implementation which retrieves information from a dynamic AD–tree, instead of reading the data set directly; this configuration requires access to the data set, but it does not require a tree-building phase and can start immediately; this configuration allows the variation of the LLT parameter;
- GdcMI, using a G-test implementation which relies on $dcMI$ to calculate CI tests, which means that it maintains a Joint Entropy Table (JHT) and a DoF cache, both described in Section 5, which are consulted to compute the G statistic without accessing the data set at all for most CI tests.

- the GStADt configuration shares the static AD–trees among its runs on each specific data set given an LLT argument, in order to amortize the cost of building it once per data set [6];
- the GDyADt configuration shares the dynamic AD–trees among its runs on each specific data set given an LLT argument, initially empty, but which are expanded as IPC–MB performs G-tests on the features of the data sets;
- the GdcMI configuration shares the JHT instances and DoF caches among its runs on each specific data set.

`multi-user`mode only and shutting down all the networking subsystems, as well as other userspace processes (such as

`dbus`,

`pulseaudio`,

`cups`).

#### 7.1. Implementation

`numpy`[27],

`scipy`[28], and

`larkparser`[29].

#### 7.2. Data Sets

`bnlearn`R package [30]. A BIF reader was implemented using the

`larkparser`Python package [29] to load them into the experimental framework. To generate the actual samples for the data sets, a custom-built Bayesian network sampler was implemented, which produced randomly generated samples on-demand, respecting the conditional dependence relationships between the features. This sampler provides the values of all the features of a sample, thus generating full sample vectors. Similar sample generators were found as libraries, but they lacked important features, such as generating a sample containing the values of all features, not only the value of a target feature.

`bnlearn`package. The ANDES Bayesian network [32] is a model used to assist in long-term student performance collected by the ANDES tutoring system. This network was categorized as ‘very large’ by the

`bnlearn`package.

#### 7.2.1. The ALARM Subexperiment

- 37 runs of $GDefault$
- 111 runs of $GStADt$, further subdivided into 3 groups of 37 runs, one for each of the selected values for the LLT parameter, namely 0, 5, and 10; the static AD–trees are shared only among the runs in each LLT group;
- 111 runs of $GDyADt$, further subdivided into 3 groups of 37 runs, one group for each of the selected values for LLT, namely 0, 5, and 10; the dynamic AD–trees are shared only among the runs in each LLT group;
- 37 runs of $GdcMI$.

- ‘Time’: the total time spent by IPC–MB performing only CI tests; these durations include neither initialization time, nor the time IPC–MB needs to perform its own computation; the time to build the static AD–trees is excluded as well;
- ‘Rate’: the average number of CI tests performed per second;
- ‘Mem’: the total amount of memory consumed by the optimization structures, in megabytes; for AD–tree–based configurations, this is the size of the AD–tree alone; for $GdcMI$, it is the combined size of the JHT and DoF cache.

#### 7.2.2. The ANDES Subexperiment

- 669 runs of $GDyADt$, further subdivided into three groups of 223 runs, one group for each of the selected values for LLT, namely 0, 5, and 10; the dynamic AD–trees are shared only among the runs in each LLT group;
- 223 runs of $GdcMI$.

## 8. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Analysis

**Proof.**

**Proof.**

**Proof.**

**Proof.**

## Appendix B. Simple Example of Reusing Joint Entropy Terms

## Appendix C. Simple Example of Building a Dynamic AD–Tree

## Appendix D. Optimizations for AD–Trees

#### Appendix D.1. Eliding the Most Common Value

A: | 0 | 1 | 2 | 1 | 1 | 2 | 2 | 2 |

B: | 1 | 0 | 0 | 2 | 2 | 1 | 1 | 1 |

#### Retrieving Counts

#### Appendix D.2. Eliding Small Subtrees at Deeper Levels

## Appendix E. Computing Degrees of Freedom

- retrieve the domains of X, Y and of each of the variables ${Z}_{1},{Z}_{2},\dots {Z}_{n}$ from the dataset; they are noted with ${V}_{X\mid \mathit{Z}}$, ${V}_{Y\mid \mathit{Z}}$ and ${V}_{{Z}_{i}}$, respectively;
- compute the unadjusted degrees of freedom:$$DoF(X,Y\mid \mathit{Z})=\left(\right)open="("\; close=")">\left|{V}_{X}\right|-1\xb7\prod _{{Z}_{i}\in \mathit{Z}}\left|{V}_{{Z}_{i}}\right|$$
- Spirtes et al. [12] deduct 1 from $DoF(X,Y\mid \mathit{Z})$ for ‘each cell of the distribution that has a zero entry’; it is not specified what the distribution is, but we assume it is most likely the conditional distribution $\mathbb{P}(X,Y\mid \mathit{Z})$ and not the joint distribution $\mathbb{P}(X,Y,\mathit{Z})$.
- return the adjusted sum as the degrees of freedom of the test between X and Y given $\mathit{Z}$.

- retrieve the PMFs $\mathbb{P}(X\mid \mathit{Z})$, $\mathbb{P}(Y\mid \mathit{Z})$ and $\mathbb{P}\left(\mathit{Z}\right)$ from the data set;
- remove all entries from these PMFs for which the probability value is 0;
- for each $\mathit{z}\in {V}_{\mathit{Z}}$, determine the domains of $\mathbb{P}(X\mid \mathit{z})$ and $\mathbb{P}(Y\mid \mathit{z})$, noted with ${V}_{X\mid \mathit{z}}$ and ${V}_{Y\mid \mathit{z}}$ respectively;
- compute the number of cells in each contingency table of X and Y given $\mathit{Z}$, but, for each table, a row and a column is deducted, resulting in the degrees of freedom of the table; these are then summed together:$$DoF(X,Y\mid \mathit{Z})=\sum _{\mathit{z}\in \mathit{Z}}\left(\right)open="("\; close=")">\left|{V}_{X\mid \mathit{z}}\right|-1$$
- return this sum as the DoF of the test between X and Y given $\mathit{Z}$.

- it does not need to query the distribution $\mathbb{P}(X,Y\mid \mathit{Z})$ for all possible values in order to find the zero values; instead, the domains of the conditional distributions $\mathbb{P}(X\mid \mathit{z})$ and $\mathbb{P}(Y\mid \mathit{z})$ are considered; this saves time because these distributions are much smaller and their domains are smaller than the theoretical domain that resulted from the Cartesian product of the domains of the variables X, Y, and $\mathit{Z}$ (they only contain what the data set contains); by contrast, the method proposed by Spirtes et al. [12] focuses on the domains of the variables, even if there will be combinations of values that do not even appear in the data set; as a consequence of this minor difference, there are less adjustments to make to the computed DoF;
- it uses the conditional probabilities already retrieved from the data set for the G-test, thus it does not require the larger $\mathbb{P}(X,Y\mid \mathit{Z})$ to be retrieved separately.

## References

- Pearl, J. Causal inference in statistics: An overview. Stat. Surv.
**2009**, 3, 96–146. [Google Scholar] [CrossRef] - Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Revised Second Printing; Kaufmann: San Francisco, CA, USA, 2008. [Google Scholar]
- Margaritis, D.; Thrun, S. Bayesian network induction via local neighborhoods. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 505–511. [Google Scholar]
- Fu, S.; Desmarais, M.C. Markov blanket based feature selection: A review of past decade. In Proceedings of the World Congress on Engineering; Newswood Ltd.: Hong Kong, China, 2010; Volume 1, pp. 321–328. [Google Scholar]
- Pena, J.M.; Nilsson, R.; Björkegren, J.; Tegnér, J. Towards scalable and data efficient learning of Markov boundaries. Int. J. Approx. Reason.
**2007**, 45, 211–232. [Google Scholar] [CrossRef] [Green Version] - Fu, S.; Desmarais, M.C. Fast Markov blanket discovery algorithm via local learning within single pass. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence, Windsor, ON, Canada, 28–30 May 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 96–107. [Google Scholar]
- Tsamardinos, I.; Aliferis, C.F.; Statnikov, A. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 673–678. [Google Scholar]
- Băncioiu, C.; Vintan, M.; Vinţan, L. Efficiency Optimizations for Koller and Sahami’s feature selection algorithm. Rom. J. Inf. Sci. Technol.
**2019**, 22, 85–99. [Google Scholar] - Koller, D.; Sahami, M. Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 284–292. [Google Scholar]
- Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res.
**2012**, 13, 27–66. [Google Scholar] - Tsamardinos, I.; Aliferis, C.; Statnikov, A.; Statnikov, E. Algorithms for Large Scale Markov Blanket Discovery. In Proceedings of the 16th International FLAIRS Conference, St. Augustine, FL, USA, 12–14 May 2003; AAAI Press: Palo Alto, CA, USA, 2003; pp. 376–380. [Google Scholar]
- Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Aliferis, C.F.; Tsamardinos, I.; Statnikov, A. HITON: A novel Markov Blanket algorithm for optimal variable selection. AMIA Annu. Symp. Proc.
**2003**, 2003, 21–25. [Google Scholar] - Moore, A.; Lee, M.S. Cached sufficient statistics for efficient machine learning with large datasets. J. Artif. Intell. Res.
**1998**, 8, 67–91. [Google Scholar] [CrossRef] [Green Version] - Komarek, P.; Moore, A.W. A Dynamic Adaptation of AD–trees for Efficient Machine Learning on Large Data Sets. In Proceedings of the Seventeenth International Conference on Machine Learning; ICML ’00; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 495–502. [Google Scholar]
- Moraleda, J.; Miller, T. AD+Tree: A Compact Adaptation of Dynamic AD–Trees for Efficient Machine Learning on Large Data Sets. In Intelligent Data Engineering and Automated Learning; Liu, J., Cheung, Y., Yin, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 313–320. [Google Scholar]
- Tsamardinos, I.; Borboudakis, G.; Katsogridakis, P.; Pratikakis, P.; Christophides, V. A greedy feature selection algorithm for Big Data of high dimensionality. Mach. Learn.
**2019**, 108, 149–202. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Tsagris, M. Conditional independence test for categorical data using Poisson log-linear model. arXiv
**2017**, arXiv:1706.02046. [Google Scholar] - Agresti, A. Categorical Data Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2003; Volume 482. [Google Scholar]
- Lehmann, E.L.; Romano, J.P. Testing Statistical Hypotheses, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
- Al-Labadi, L.; Fazeli Asl, F.; Saberi, Z. A test for independence via Bayesian nonparametric estimation of mutual information. Can. J. Stat.
**2021**. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- McDonald, J.H. Handbook of Biological Statistics, 3rd ed.; Sparky House Publishing: Baltimore, MD, USA, 2014; pp. 53–58. [Google Scholar]
- Everitt, B. The Cambridge Dictionary of Statistics; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
- Lamont, A. What Exactly Are Degrees of Freedom?: A Tool for Graduate Students in the Social Sciences; University of South Carolina: Columbia, SC, USA, 2015. [Google Scholar] [CrossRef]
- Băncioiu, C. MBTK, a Library for Studying Markov Boundary Algorithms. 2020. Available online: https://github.com/camilbancioiu/mbtk (accessed on 10 September 2021).
- Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature
**2020**, 585, 357–362. [Google Scholar] [CrossRef] [PubMed] - Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods
**2020**, 17, 261–272. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Lark—A Parsing Toolkit for Python. Available online: https://github.com/lark-parser/lark (accessed on 3 May 2021).
- Scutari, M. bnlearn—An R Package for Bayesian Network Learning and Inference. Available online: https://www.bnlearn.com/ (accessed on 6 June 2020).
- Beinlich, I.A.; Suermondt, H.J.; Chavez, R.M.; Cooper, G.F. The ALARM Monitoring System: A Case Study with two Probabilistic Inference Techniques for Belief Networks. In AIME 89; Hunter, J., Cookson, J., Wyatt, J., Eds.; Springer: Berlin/Heidelberg, Germany, 1989; pp. 247–256. [Google Scholar]
- Conati, C.; Gertner, A.S.; VanLehn, K.; Druzdzel, M.J. On-Line Student Modeling for Coached Problem Solving Using Bayesian Networks. In User Modeling; Jameson, A., Paris, C., Tasso, C., Eds.; Springer: Vienna, Austria, 1997; pp. 231–242. [Google Scholar]

LLT | 4000 Samples | 8000 Samples | 16,000 Samples |
---|---|---|---|

0 | 13.7 | 31.1 | 67.3 |

5 | 0.9 | 1.7 | 3.0 |

10 | 0.5 | 1.1 | 2.1 |

**Table 2.**Results of the ALARM subexperiment for 4000, 8000 and 16,000 samples. The table header also shows the amount of CI tests performed in each case. $L00$, $L05$, and $L10$ are abbreviations of LLT 0, LLT 5, and LLT 10, respectively.

4000 Samples | 8000 Samples | 16,000 Samples | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

26,430 CI Tests | 33,183 CI Tests | 34,291 CI Tests | |||||||||

Configuration | Time | Rate | Mem | Time | Rate | Mem | Time | Rate | Mem | ||

(s) | (s${}^{-\mathbf{1}}$) | (MB) | (s) | (s${}^{-\mathbf{1}}$) | (MB) | (s) | (s${}^{-\mathbf{1}}$) | (MB) | |||

GDefault | 161 | 163.7 | — | 389 | 85.2 | — | 786 | 43.7 | — | ||

GStADt L00 | 24 | 1074.9 | 516.1 | 34 | 948.4 | 1123.4 | 38 | 894.0 | 2452.1 | ||

GStADt L05 | 57 | 459.2 | 34.1 | 95 | 348.5 | 54.8 | 123 | 278.4 | 87.4 | ||

GStADt L10 | 63 | 415.5 | 19.5 | 105 | 314.6 | 34.5 | 140 | 243.6 | 58.9 | ||

GDyADt L00 | 25 | 1044.4 | 14.6 | 37 | 892.5 | 28.6 | 40 | 838.5 | 49.6 | ||

GDyADt L05 | 58 | 453.7 | 12.6 | 96 | 345.4 | 25.3 | 127 | 268.5 | 45.2 | ||

GDyADt L10 | 66 | 400.0 | 11.1 | 106 | 310.9 | 23.0 | 143 | 239.0 | 41.7 | ||

GdcMI | 11 | 2522.7 | 4.2 | 18 | 1815.4 | 4.6 | 32 | 1063.2 | 4.8 |

**Table 3.**Results of the ANDES subexperiment for 4000, 8000, and 16,000 samples. The table header also shows the amount of CI tests performed in each case. $L00$, $L05$, and $L10$ are abbreviations of LLT 0, LLT 5, and LLT 10, respectively.

4000 Samples | 8000 Samples | 16,000 Samples | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

1,155,034 CI Tests | 2,841,520 CI Tests | 4,830,900 CI Tests | |||||||||

Configuration | Time | Rate | Mem | Time | Rate | Mem | Time | Rate | Mem | ||

(s) | (s${}^{-\mathbf{1}}$) | (MB) | (s) | (s${}^{-\mathbf{1}}$) | (MB) | (s) | (s${}^{-\mathbf{1}}$) | (MB) | |||

GDyADt L00 | 6001 | 192.4 | 567.8 | 30,462 | 93.2 | 1287.2 | 123,680 | 39.0 | 2718.1 | ||

GDyADt L05 | 8783 | 131.5 | 563.3 | 50,171 | 56.6 | 1269.0 | 261,236 | 18.5 | 2677.9 | ||

GDyADt L10 | 10,398 | 111.0 | 559.0 | 64,255 | 44.2 | 1257.8 | 346,235 | 13.9 | 2653.5 | ||

GdcMI | 481 | 2401.1 | 166.1 | 1665 | 1706.4 | 369.2 | 5752 | 839.7 | 731.0 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Băncioiu, C.; Brad, R.
Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse. *Entropy* **2021**, *23*, 1501.
https://doi.org/10.3390/e23111501

**AMA Style**

Băncioiu C, Brad R.
Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse. *Entropy*. 2021; 23(11):1501.
https://doi.org/10.3390/e23111501

**Chicago/Turabian Style**

Băncioiu, Camil, and Remus Brad.
2021. "Accelerating Causal Inference and Feature Selection Methods through G-Test Computation Reuse" *Entropy* 23, no. 11: 1501.
https://doi.org/10.3390/e23111501