# Topological Information Data Analysis

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

“When you use the word information, you should rather use the word form”–René Thom

Contents | ||

1 | Introduction | 2 |

1.1 | Information Decompositions and Multivariate Statistical Dependencies | 2 |

1.2 | The Approach by Information Topology | 3 |

2 | Theory: Homological Nature of Entropy and Information Functions | 6 |

3 | Results | 8 |

3.1 | Entropy and Mutual-Information Decompositions | 8 |

3.2 | The Independence Criterion | 11 |

3.3 | Information Coordinates | 12 |

3.4 | Mutual-Information Negativity and Links | 16 |

4 | Experimental Validation: Unsupervised Classification of Cell Types and Gene Modules | 18 |

4.1 | Gene Expression Dataset | 18 |

4.2 | I_{k} Positivity and General Correlations, Negativity and Clusters | 18 |

4.3 | Cell Type Classification | 20 |

4.3.1 | Example of Cell Type Classification with a Low Sample Size m = 41, Dimension n = 20, and Graining N = 9. | 20 |

4.3.2 | Total Correlations (Multi-Information) vs. Mutual-Information | 22 |

5 | Discussion | 23 |

5.1 | Topological and Statistical Information Decompositions | 23 |

5.2 | Mutual-Information Positivity and Negativity | 23 |

5.3 | Total Correlations (Multi-Information) | 24 |

5.4 | Beyond Pairwise Statistical Dependences: Combinatorial Information Storage | 24 |

6 | Materials and Methods | 25 |

6.1 | The Dataset: Quantified Genetic Expression in Two Cell Types | 25 |

6.2 | Probability Estimation | 26 |

6.3 | Computation of k-Entropy, k-Information Landscapes and Paths | 28 |

6.4 | Estimation of the Undersampling Dimension | 28 |

6.4.1 | Statistical Result | 28 |

6.4.2 | Computational Result | 29 |

6.5 | k-Dependence Test | 30 |

6.6 | Sampling Size and Graining Landscapes—Stability of Minimum Energy Complex Estimation | 32 |

A | Appendix: Bayes Free Energy and Information Quantities | 34 |

A.1 | Parametric Modelling | 34 |

A.2 | Bethe Approximation | 35 |

References | 35 |

## 1. Introduction

#### 1.1. Information Decompositions and Multivariate Statistical Dependencies

#### 1.2. The Approach by Information Topology

## 2. Theory: Homological Nature of Entropy and Information Functions

**Remark**

**1.**

**Remark**

**2.**

**Proposition**

**1.**

**Remark**

**3.**

## 3. Results

#### 3.1. Entropy and Mutual-Information Decompositions

**Theorem**

**1.**

**Corollary**

**1.**

**Remark**

**4.**

**Proposition**

**2.**

**Proposition**

**3.**

**Remark**

**5.**

**Definition**

**1.**

**Remark**

**6.**

**Proposition**

**4.**

**Proof.**

**Proposition**

**5.**

**Proposition**

**6.**

**Proof.**

**Corollary**

**2.**

**Proof.**

**Proposition**

**7.**

#### 3.2. The Independence Criterion

**Remark**

**7.**

**Theorem**

**2.**

**Proof.**

#### 3.3. Information Coordinates

**Lemma**

**1.**

**Proof.**

**Theorem**

**3.**

**Proof.**

**Theorem**

**4.**

**Proof.**

**Remark**

**8.**

**Remark**

**9.**

**Remark**

**10.**

#### 3.4. Mutual-Information Negativity and Links

- (1)
- when two variables are independent from the information of the three is negative or zero;
- (2)
- when two variables are conditionally independent with respect to the third, the information of the three is positive or zero.

**Proposition**

**8.**

**Proof.**

**Proposition**

**9.**

**Proof.**

## 4. Experimental Validation: Unsupervised Classification of Cell Types and Gene Modules

#### 4.1. Gene Expression Dataset

- The analysis with genes as variables: in this case, the “Heat maps” correspond to $(m,n)$ matrices D (presented in the Section 6.2) together with the labels (population A or population B) of the cells. The data analysis consists of the unsupervised classification of gene modules (presented in Section 4.2).
- The analysis with cells (neurons) as variables: in this case, the “Heat maps” correspond to the transposed matrices ${D}^{T}$ (presented in Section 4.3.1) together with the labels (population A or population B) of the cells. The data analysis consists of the unsupervised classification of cell types.

#### 4.2. ${I}_{k}$ Positivity and General Correlations, Negativity and Clusters

#### 4.3. Cell Type Classification

#### 4.3.1. Example of Cell Type Classification with a Low Sample Size $m=41$, Dimension $n=20$, and Graining $N=9$.

#### 4.3.2. Total Correlations (Multi-Information) vs. Mutual-Information

**d**quantifies this departure from linearity. However, the ${G}_{k}$ landscape fails to distinguish, as clearly as the ${I}_{k}$ landscape does, the population A.

## 5. Discussion

#### 5.1. Topological and Statistical Information Decompositions

#### 5.2. Mutual-Information Positivity and Negativity

#### 5.3. Total Correlations (Multi-Information)

#### 5.4. Beyond Pairwise Statistical Dependences: Combinatorial Information Storage

## 6. Materials and Methods

#### 6.1. The Dataset: Quantified Genetic Expression in Two Cell Types

#### 6.2. Probability Estimation

#### 6.3. Computation of k-Entropy, k-Information Landscapes and Paths

#### 6.4. Estimation of the Undersampling Dimension

#### 6.4.1. Statistical Result

**Lemma**

**2.**

**Proof.**

#### 6.4.2. Computational Result

- When ${N}_{1}={N}_{2}=...={N}_{n}=1$, there is a single box $\Omega $ and $P(\Omega )=m/m=1$ and we have ${H}_{k}={I}_{k}=0,\forall k\in 0,...,n$. The case where $m=1$ is identical. This fixes the lower bound of our analysis in order not to be trivial; we need $m\ge 2$ and ${N}_{1}={N}_{2}=...={N}_{n}\ge 2$.
- When ${N}_{1}.{N}_{2}...{N}_{n}$ are such that only one data point falls into a box, m of the values of atomic probabilities are $1/m$ and ${N}_{1}.{N}_{2}...{N}_{n}-m$ are null as a consequence of Equation (71), and hence we have ${H}_{n}={log}_{2}m$.

#### 6.5. k-Dependence Test

#### 6.6. Sampling Size and Graining Landscapes—Stability of Minimum Energy Complex Estimation

## Author Contributions

## Acknowledgments

## Conflicts of Interest

## Abbreviations

iid | independent identically distributed |

DA | Dopaminergic neuron |

nDA | non Dopaminergic neuron |

${H}_{k}$ | Multivariate k-joint Entropy |

${I}_{k}$ | Multivariate k-Mutual-Information |

${G}_{k}$ | Multivariate k-Total-Correlation or k-Multi-Information |

$M{I}_{2}C$ | Maximal 2-Mutual-Information Coefficient |

## Appendix A. Appendix: Bayes Free Energy and Information Quantities

#### Appendix A.1. Parametric Modelling

#### Appendix A.2. Bethe Approximation

## References

- Baudot, P.; Bennequin, D. The Homological Nature of Entropy. Entropy
**2015**, 17, 3253–3318. [Google Scholar] [CrossRef] - Vigneaux, J. The structure of information: From probability to homology. arXiv
**2017**, arXiv:1709.07807. [Google Scholar] - Vigneaux, J.P. Topology of Statistical Systems. A Cohomological Approach to Information Theory. Ph.D. Thesis, Paris 7 Diderot University, Paris, France, 2019. [Google Scholar]
- Tapia, M.; Baudot, P.; Formizano-Treziny, C.; Dufour, M.; Temporal, S.; Lasserre, M.; Marqueze-Pouey, B.; Gabert, J.; Kobayashi, K.; Goaillard, J.-M. Neurotransmitter identity and electrophysiological phenotype are genetically coupled in midbrain dopaminergic neurons. Sci. Rep.
**2018**, 8, 13637. [Google Scholar] [CrossRef] - Gibbs, J. Elementary Principles in Statistical Mechanics; Dover Edition (1960 Reprint); Charles Scribner’s Sons: New York, NY, USA, 1902. [Google Scholar]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] [Green Version] - Shannon, C. A lattice theory of information. Trans. IRE Prof. Group Inform. Theory
**1953**, 1, 105–107. [Google Scholar] [CrossRef] - McGill, W. Multivariate information transmission. Psychometrika
**1954**, 19, 97–116. [Google Scholar] [CrossRef] - Fano, R. Transmission of Information: A Statistical Theory of Communication; MIT Press: Cambridge, MA, USA, 1961. [Google Scholar]
- Hu, K.T. On the Amount of Information. Theory Probab. Appl.
**1962**, 7, 439–447. [Google Scholar] - Han, T.S. Linear dependence structure of the entropy space. Inf. Control
**1975**, 29, 337–368. [Google Scholar] [CrossRef] [Green Version] - Han, T.S. Nonnegative entropy measures of multivariate symmetric correlations. IEEE Inf. Control
**1978**, 36, 133–156. [Google Scholar] [CrossRef] [Green Version] - Matsuda, H. Information theoretic characterization of frustrated systems. Phys. Stat. Mech. Its Appl.
**2001**, 294, 180–190. [Google Scholar] [CrossRef] - Bell, A. The co-information lattice. In Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation, Nara, Japan, 1–4 April 2003. [Google Scholar]
- Brenner, N.; Strong, S.; Koberle, R.; Bialek, W. Synergy in a Neural Code. Neural Comput.
**2000**, 12, 1531–1552. [Google Scholar] [CrossRef] - Watkinson, J.; Liang, K.; Wang, X.; Zheng, T.; Anastassiou, D. Inference of Regulatory Gene Interactions from Expression Data Using Three-Way Mutual Information. Chall. Syst. Biol. Ann. N. Y. Acad. Sci.
**2009**, 1158, 302–313. [Google Scholar] [CrossRef] - Kim, H.; Watkinson, J.; Varadan, V.; Anastassiou, D. Multi-cancer computational analysis reveals invasion-associated variant of desmoplastic reaction involving INHBA, THBS2 and COL11A1. BMC Med. Genom.
**2010**, 3, 51. [Google Scholar] [CrossRef] - Watanabe, S. Information theoretical analysis of multivariate correlation. Ibm J. Res. Dev.
**1960**, 4, 66–81. [Google Scholar] [CrossRef] - Tononi, G.; Edelman, G. Consciousness and Complexity. Science
**1998**, 282, 1846–1851. [Google Scholar] [CrossRef] - Tononi, G.; Edelman, G.; Sporns, O. Complexity and coherency: Integrating information in the brain. Trends Cogn. Sci.
**1998**, 2, 474–484. [Google Scholar] [CrossRef] - Studeny, M.; Vejnarova, J. The multiinformation function as a tool for measuring stochastic dependence. In Learning in Graphical Models; Jordan, M.I., Ed.; MIT Press: Cambridge, UK, 1999; pp. 261–296. [Google Scholar]
- Schneidman, E.; Bialek, W.; Berry, M.n. Synergy, redundancy, and independence in population codes. J. Neurosci.
**2003**, 23, 11539–11553. [Google Scholar] [CrossRef] - Slonim, N.; Atwal, G.; Tkacik, G.; Bialek, W. Information-based clustering. Proc. Natl. Acad. Sci. USA
**2005**, 102, 18297–18302. [Google Scholar] [CrossRef] [Green Version] - Brenner, N.; Bialek, W.; de Ruyter van Steveninck, R. Adaptive Rescaling Maximizes Information Transmission. Neuron
**2000**, 26, 695–702. [Google Scholar] [CrossRef] [Green Version] - Laughlin, S. A simple coding procedure enhances the neuron’s information capacity. Z. Naturforsch
**1981**, 36, 910–912. [Google Scholar] [CrossRef] - Margolin, A.; Wang, K.; Califano, A.; Nemenman, I. Multivariate dependence and genetic networks inference. IET Syst. Biol.
**2010**, 4, 428–440. [Google Scholar] [CrossRef] [Green Version] - Williams, P.; Beer, R. Nonnegative Decomposition of Multivariate Information. arXiv
**2010**, arXiv:1004.2515v1. [Google Scholar] - Olbrich, E.; Bertschinger, N.; Rauh, J. Information Decomposition and Synergy. Entropy
**2015**, 17, 3501–3517. [Google Scholar] [CrossRef] [Green Version] - Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy
**2014**, 16, 2161–2183. [Google Scholar] [CrossRef] - Griffith, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar] [Green Version]
- Wibral, M.; Finn, C.; Wollstadt, P.; Lizier, J.; Priesemann, V. Quantifying Information Modification in Developing Neural Networks via Partial Information Decomposition. Entropy
**2017**, 19, 494. [Google Scholar] [CrossRef] - Kay, J.; Ince, R.; Dering, B.; Phillips, W. Partial and Entropic Information Decompositions of a Neuronal Modulatory Interaction. Entropy
**2017**, 19, 560. [Google Scholar] [CrossRef] - Rauh, J.; Bertschinger, N.; Olbrich, E.; Jost, J. Reconsidering unique information: Towards a multivariate information decomposition. In Proceedings of the IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014. [Google Scholar]
- Abdallah, S.A.; Plumbley, M.D. Predictive Information, Multiinformation and Binding Information; Technical Report; Queen Mary, University of London: London, UK, 2010. [Google Scholar]
- Valverde-Albacete, F.; Pelaez-Moreno, C. Assessing Information Transmission in Data Transformations with the Channel Multivariate Entropy Triangle. Entropy
**2018**, 20, 498. [Google Scholar] [CrossRef] - Valverde-Albacete, F.; Pelaez-Moreno, C. The evaluation of data sources using multivariate entropy tools. Expert Syst. Appl.
**2017**, 78, 145–157. [Google Scholar] [CrossRef] - Baudot, P. The Poincaré-Boltzmann Machine: From Statistical Physics to Machine Learning and back. arXiv
**2019**, arXiv:1907.06486. [Google Scholar] - Khinchin, A. Mathematical Foundations of Information Theory; Translated by R. A. Silverman and M.D. Friedman from Two Russian Articles in Uspekhi Matematicheskikh Nauk, 7 (1953): 320 and 9 (1956): 1775; Dover: New York, NY, USA, 1957. [Google Scholar]
- Artin, M.; Grothendieck, A.; Verdier, J. Theorie des Topos et Cohomologie Etale des Schemas—(SGA 4) Vol I,II,III; Seminaire de Geometrie Algebrique du Bois Marie 1963–1964. Berlin, coll. e Lecture Notes in Mathematics; Springer: New York, NY, USA, 1972. [Google Scholar]
- Rota, G. On the Foundations of Combinatorial Theory I. Theory of Moebius Functions. Z. Wahrseheinlichkeitstheorie
**1964**, 2, 340–368. [Google Scholar] [CrossRef] - Cover, T.; Thomas, J. Elements of Information Theory; Wiley Series in Telecommunication; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 1991. [Google Scholar]
- Kellerer, H. Masstheoretische Marginalprobleme. Math. Ann.
**1964**, 153, 168–198. [Google Scholar] [CrossRef] - Matus, F. Discrete marginal problem for complex measures. Kybernetika
**1988**, 24, 39–46. [Google Scholar] - Reshef, D.; Reshef, Y.; Finucane, H.; Grossman, S.; McVean, G.; Turnbaugh, P.; Lander, E.; Mitzenmacher, M.; Sabeti, P. Detecting Novel Associations in Large Data Sets. Science
**2011**, 334, 1518. [Google Scholar] [CrossRef] - Tapia, M.; Baudot, P.; Dufour, M.; Formizano-Treziny, C.; Temporal, S.; Lasserre, M.; Kobayashi, K.; Goaillard, J.M. Information topology of gene expression profile in dopaminergic neurons. BioArXiv
**2017**, 168740. [Google Scholar] [CrossRef] [Green Version] - Dawkins, R. Selfish Gene, 1st ed.; Oxford University Press: Oxford, UK, 1976. [Google Scholar]
- Pethel, S.; Hahs, D. Exact Test of Independence Using Mutual Information. Entropy
**2014**, 16, 2839–2849. [Google Scholar] [CrossRef] [Green Version] - Schreiber, T. Measuring Information Transfer. Phys. Rev. Lett.
**2000**, 85, 461–464. [Google Scholar] [CrossRef] [Green Version] - Barnett, L.; Barrett, A.; Seth, A.K. Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables. Phys. Rev. Lett.
**2009**, 103, 238701. [Google Scholar] [CrossRef] [Green Version] - Kolmogorov, A.N. Grundbegriffe der Wahrscheinlichkeitsrechnung; English translation (1950): Foundations of the theory of probability; Springer: Berlin, Germany; Chelsea, MA, USA, 1933. [Google Scholar]
- Loday, J.L.; Valette, B. Algebr. Operads; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Tkacik, G.; Marre, O.; Amodei, D.; Schneidman, E.; Bialek, W.; Berry, M.J., II. Searching for collective behavior in a large network of sensory neurons. PLoS Comput. Biol.
**2014**, 10, e1003408. [Google Scholar] [CrossRef] - Schneidman, E.; Berry, M., 2nd; Segev, R.; Bialek, W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature
**2006**, 440, 1007–1012. [Google Scholar] [CrossRef] [Green Version] - Merchan, L.; Nemenman, I. On the Sufficiency of Pairwise Interactions in Maximum Entropy Models of Networks. J. Stat. Phys.
**2016**, 162, 1294–1308. [Google Scholar] [CrossRef] [Green Version] - Humplik, J.; Tkacik, G. Probabilistic models for neural populations that naturally capture global coupling and criticality. PLoS Comput. Biol.
**2017**, 13, e1005763. [Google Scholar] [CrossRef] - Atick, J. Could information theory provide an ecological theory of sensory processing. Netw. Comput. Neural Syst.
**1992**, 3, 213–251. [Google Scholar] [CrossRef] - Baudot, P. Natural Computation: Much ado about Nothing? An Intracellular Study of Visual Coding in Natural Condition. Master’s Thesis, Paris 6 University, Paris, France, 2006. [Google Scholar]
- Yedidia, J.; Freeamn, W.; Weiss, Y. Understanding belief propagation and its generalizations. Destin. Lect. Conf. Artif. Intell.
**2001**, 8, 236–239. [Google Scholar] - Reimann, M.; Nolte, M.; Scolamiero, M.; Turner, K.; Perin, R.; Chindemi, G.; Dłotko, P.; Levi, R.; Hess, K.; Markram, H. Cliques of Neurons Bound into Cavities Provide a Missing Link between Structure and Function. Front. Comput. Neurosci.
**2017**, 12, 48. [Google Scholar] [CrossRef] - Gibbs, J. A Method of Geometrical Representation of the Thermodynamic Properties of Substances by Means of Surfaces. Trans. Conn. Acad.
**1873**, 2, 382–404. [Google Scholar] - Landauer, R. Irreversibility and heat generation in the computing process. IBM J. Res. Dev.
**1961**, 5, 183–191. [Google Scholar] [CrossRef] - Shipman, J. Tkinter Reference: A GUI for Python; New Mexico Tech Computer Center: Socorro, NM, USA, 2010. [Google Scholar]
- Hunter, J. Matplotlib: A 2D graphics environment. Comput. Sci. Eng.
**2007**, 9, 22–30. [Google Scholar] [CrossRef] - Van Der Walt, S.; Colbert, C.; Varoquaux, G. The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng.
**2011**, 13, 22–30. [Google Scholar] [CrossRef] - Hagberg, A.; Schult, D.; Swart, P. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), Pasadena, CA, USA, 19–24 August 2008; Varoquaux, G., Vaught, T., Millman, J., Eds.; pp. 11–15. [Google Scholar]
- Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory
**1968**, 14, 55–63. [Google Scholar] [CrossRef] - Strong, S.; de Ruyter van Steveninck, R.; Bialek, W.; Koberle, R. On the application of information theory to neural spike trains. Pac. Symp. Biocomput.
**1998**, 1998, 621–632. [Google Scholar] - Nemenman, I.; Bialek, W.; de Ruyter van Steveninck, R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E
**2004**, 69, 056111. [Google Scholar] [CrossRef] [Green Version] - Borel, E. La mechanique statistique et l’irreversibilite. J. Phys. Theor. Appl.
**1913**, 3, 189–196. [Google Scholar] [CrossRef] - Scott, D. Multivariate Density Estimation. Theory, Practice and Visualization; Wiley: New York, NY, USA, 1992. [Google Scholar]
- Epstein, C.; Carlsson, G.; Edelsbrunner, H. Topological data analysis. Inverse Probl.
**2011**, 27, 120201. [Google Scholar] - Baudot, P.; Tapia, M.; Goaillard, J. Topological Information Data Analysis: Poincare-Shannon Machine and Statistical Physic of Finite Heterogeneous Systems. Preprints
**2018**, 2018040157. [Google Scholar] [CrossRef] - Ly, A.; Marsman, M.; Verhagen, J.; Grasman, R.; Wagenmakers, E.J. A Tutorial on Fisher Information. J. Math. Psychol.
**2017**, 80, 44–55. [Google Scholar] [CrossRef] - Mori, R. New Understanding of the Bethe Approximation and the Replica Method. Ph.D. Thesis, Kyoto University, Kyoto, Japan, 2013. [Google Scholar]

**Figure 1.**Example of the four maxima (left panel) and of the two minima of ${I}_{3}$ for three binary variables (

**a**) informal representation of the 7-simplex of probability associated with three binary variables. The values of the atomic probabilities that achieve the extremal configurations are noted in each vertex. (

**b**) representation of the associated probabilities in the data space of the 3-variables for these extremal configurations. (

**c**) information ${I}_{k}$ landscapes of these configurations (top). Representation of these extremal configurations on the probability cube. The colors represents the non-null atomic probability of each extremal configuration (bottom).

**Figure 2.**Examples of some of 4-modules (quaduplets) with the highest (positive) and lowest (negative) ${I}_{4}$ of gene expression represented in the data space. (

**a**) two 4-modules of genes sharing among the highest positive ${I}_{4}$ of the gene expression data set (cf. Section 6.1). The data are represented in the data space of the measured expression of the 4-variable genes. The fourth dimension-variable is color coded. (

**b**) two 4-modules of genes sharing among the lowest negative ${I}_{4}$. All the modules were found to be significant according to the dependence test introduced in Section 6.5, except the module $\{17,19,21,13\}$. The identified extremal modules (different) give similar patterns of dependences [4,45].

**Figure 3.**Example of a ${I}_{k}$ landscape and path analysis. (

**a**) heatmap (transpose of matrix D) of $n=20$ neurons with $m=41$ genes. (

**b**) the corresponding ${H}_{k}$ landscape. (

**c**) the corresponding ${I}_{k}$ landscape (

**d**) maximum (in red) and minimum (in blue) ${I}_{k}$ information paths. (

**e**) histograms of the distributions of ${I}_{k}$ for $k=1,..,12$. See text for details.

**Figure 4.**${I}_{k}$, ${H}_{k}$ and ${G}_{k}$ (Total Free Energy, TFE) landscapes. (

**a**) entropy ${H}_{k}$ and (

**b**) mutual-information ${I}_{k}$ (free energy components) landscapes (same representation as Figure 3, ${k}_{u}=11$, p value 0.05); (

**c**) ${G}_{k}$ landscape (total correlation or multi-information or Integrated Information or total free energy); (

**d**) the landscape of the ${G}_{k}$ per body (${G}_{k}/k$).

**Figure 5.**

**${H}_{k}-{I}_{k}$ landscape: Gibbs–Maxwell’s entropy vs. energy representation.**${H}_{k}$ and ${I}_{k}$ are plotted in abscissa and ordinate respectively for dimension $k=1,\dots ,12$ for the same data and setting as in Figure 3 ($n=20$ cells, $m=47$ genes, $N=9$, ${k}_{u}=11$). Compare the difficulty in identifying the two-cell types from the pairwise $k=2$ landscape to the $k=10$ landscape.

**Figure 6.**Principles of probability estimation for two random variables. (

**a**) illustration of the basic procedure used in practice to estimate the probability density for the two genes ($n=2$) Gene5 and Gene21 in 111 population A neurons ($m=111$) using a graining of 9 (${N}_{1}={N}_{2}=9$). The data points corresponding to the 111 observations are represented as red dots, and the graining is depicted by the 81-box grid (${N}_{1}.{N}_{2}$). The borders of the graining interval are obtained by considering the maximum and minimum measured values for each variable, and data are then sampled regularly within this interval with ${N}_{i}$ values. Projections of the data points on lower dimensional variable subspaces (${X}_{1}$ and ${X}_{2}$ axes here) are obtained by marginalization, giving the marginal probability laws for the two variables ${X}_{1}$ and ${X}_{2}$ (${P}_{{X}_{i},{N}_{i},m}$), represented as histograms above the ${X}_{1}$-axis for Gene21 and on the right of the ${X}_{2}$-axis for Gene21; (

**b**) heatmaps representing the levels of expression of the 21 genes of interest on a ${log}_{2}Ex$ scale (top, raw heatmap) and after resampling with a graining of 9 (bottom, ${N}_{1}={N}_{2}=\dots ={N}_{21}=9$).

**Figure 7.**Determination of undersampling dimension ${k}_{u}$. (

**a**) distributions of ${H}_{k}$ for $m=111$ population A neurons (green) and $m=37$ population B neurons (dark red) for $k=1,..,6$. The horizontal red line represents the threshold we have fixed to 5 percent of the total number of k-tuples. (

**b**) plot of the percent of maximum entropy ${H}_{k}=lnm$ biased values as a function of the dimension k. The horizontal red line represents the threshold fixed to 5 percent, giving ${k}_{u}=6$ for population A and ${k}_{u}=4$ for population B neurons. (

**c**) the mean $\langle HP\rangle (k)$ paths for these two populations of neurons, the maximum entropy ${H}_{k}=lnm$ is represented by plain horizontal lines.

**Figure 8.**Probability and Information landscape of shuffled data. The figure corresponds to the case of analysis with genes as variables. (

**a**) joint and marginal distributions of two genes (genes 4 and 12) for $m=111$ population A neurons. (

**b**) joint and marginal distributions after a shuffling of the values of expression of each gene. (

**c**) the estimated ${I}_{k}$ landscape for the expression of 21 genes after shuffling. (

**d**) histograms representing the distribution of ${I}_{k}$ values for all the degrees until $k=5$ for population B. The total number of combinations C(n,k) for each degree (number of pairs for ${I}_{2}$; number of triplets for ${I}_{3}$, etc.) is given in gray. The averaged shuffled values of information obtained with 17 shuffles are represented on each histogram as a black line, and the statistical significance threshold values for $p=0.1$ are represented as vertical dotted lines.

**Figure 9.**Effect of changing sample size and graining on the identification of gene modules. The figure corresponds to the case of analysis with genes as variables for the population A neurons. The positive ${I}_{k}$ paths of maximum length were computed for a variable number of cells (m, left column) and a variable graining (N, right column). For clarity, only the two positive paths of maximum length are represented (first in red, second in black) for each parameter setting and the direction of each path is indicated by arrowheads. The two positive paths of maximum length for the original setting ($N=9$, $m=111$) are represented on the scaffold at the top of the figure for comparison. Smaller samples of cells (one random pick of 34, 56 and 89 cells) and larger ($N=11$) or smaller ($N=5,N=7$) graining than the original ($N=9$) were tested. Although slight differences in paths can be seen (especially for $N=11$), most of the parameter combinations identify gene modules that strongly overlap with the module identified using the original setting.

**Figure 10.**Iso-sample-size (m) and iso-graining mean $\langle IP\rangle (k)$ landscapes. The figure corresponds to the case of analysis with genes as variables for the population A neurons. (

**a**) the mean $\langle IP\rangle (k)$ paths are presented for $N=2,\dots ,18$ and $n=21$ genes for the $m=111$ population A neurons. The “undersampling” region beyond the ${k}_{u}$ is shaded in white and delimited by a black dotted line (the ${k}_{u}$ was undetermined for $N=2,3$). For $N=2$, the mean $\langle IP\rangle (k)$ path has no non-trivial minimum (monotonically decreasing). This $N=2$ iso-graining is analog to the non condensed disordered phase of non interacting bodies, $\forall k>1,\phantom{\rule{3.33333pt}{0ex}}\langle IP\rangle (k)\approx 0$. All the other mean $\langle IP\rangle (k)$ paths have non-trivial critical dimensions. The condition $N=9$, $m=111$ used for the analysis is surrounded by dotted red lines. It was chosen to be in the condensed phase above the critical graining; here, ${N}_{c}=3$, close to the criterion of maximal mutual-information coefficient $M{I}_{2}C$ proposed by Reshef and colleagues (bin surrounded by green dotted line) and with a not too low undersampling dimension. (

**b**) the mean $\langle IP\rangle (k)$ paths are presented for $m=111,100,\dots ,12$ population A neurons and $n=21$ genes with a number of bins $N=9$.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Baudot, P.; Tapia, M.; Bennequin, D.; Goaillard, J.-M.
Topological Information Data Analysis. *Entropy* **2019**, *21*, 869.
https://doi.org/10.3390/e21090869

**AMA Style**

Baudot P, Tapia M, Bennequin D, Goaillard J-M.
Topological Information Data Analysis. *Entropy*. 2019; 21(9):869.
https://doi.org/10.3390/e21090869

**Chicago/Turabian Style**

Baudot, Pierre, Monica Tapia, Daniel Bennequin, and Jean-Marc Goaillard.
2019. "Topological Information Data Analysis" *Entropy* 21, no. 9: 869.
https://doi.org/10.3390/e21090869