# Neural Estimator of Information for Time-Series Data with Dependency

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Preliminaries

#### 2.1. Notation

#### 2.2. Information Measures

**Assumption**

**1.**

**X**and $\mathbf{Y}$, respectively. Both quantities are functions of the CMI and Figure 1 visualizes the corresponding variables in each CMI term for DI and TE. In particular, each CMI term in (3) quantifies the amount of shared information between ${X}^{i}$ and ${Y}_{i}$ conditioned on ${Y}^{i-1}$, i.e., it excludes the effect of the causal history of $\mathbf{Y}$. In a general form, to express the causal effect of the process $\mathbf{X}$ on $\mathbf{Y}$ conditioning causally on $\mathbf{Z}$, DI is normalized with respect to n which is defined below and denoted as directed information rate (DIR):

**Example**

**1.**

#### 2.3. Estimating the Variational Bound

**Definition**

**1.**

- (1)
- Construct the joint batch, containing samples generated according to $p(x,y,z)$.
- (2)
- Construct the product batch, containing samples generated according to $p(x|z)p(y,z)$.
- (3)
- Train the neural network with a particular loss function, which we explain later, to approximate ${f}^{*}(x,y,z)$, i.e., the density ratio of $\frac{p(x,y,z)}{p(x|z)p(y,z)}$.
- (4)
- Compute (11) using the batches and the approximated function.

## 3. Main Results

#### 3.1. Batch Construction

Algorithm 1: Construction of the joint batch |

**Definition**

**2**(Joint batch)

**.**

Algorithm 2: Construction of the product batch |

**Definition**

**3**(Product batch)

**.**

**Remark**

**1.**

#### 3.2. Training the Classifier

**Remark**

**2.**

#### 3.3. Estimation of the DV Bound

Algorithm 3: Estimation of CMI |

#### 3.4. Consistency Analysis

#### 3.4.1. Convergence for the Joint Batch

**Proposition**

**1.**

**Proof.**

#### 3.4.2. Convergence for the Product Batch

**Definition**

**4**

**.**A process $\mathit{U}$ is ϕ-mixing if for a sequence ${\{{\varphi}_{n}\}}_{n\in \mathbb{N}}$ of positive numbers satisfying ${\varphi}_{n}\to 0$ as $n\to \infty $, for any integer $i>0$ we have:

**Assumption**

**2.**

**Assumption**

**3.**

**Proposition**

**2.**

**Proof.**

**Remark**

**3.**

#### 3.4.3. Convergence of the Overall Estimation

**Assumption**

**4.**

**Assumption**

**5.**

**Theorem**

**1.**

**Proof.**

## 4. Simulation Results

#### 4.1. Estimating Conditional Mutual Information

#### 4.2. Estimating Directed Information

## 5. Conclusions and Future Directions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

Probability density function | |

IID | Independent and identically distributed |

MI | Mutual information |

CMI | Conditional mutual information |

DI | Directed information |

DIR | Directed information rate |

TE | Transfer entropy |

DV | Donsker-Varadhan |

NWJ | Nguyen-Wainwright-Jordan |

k-NN | k nearest neighbors |

ML | Machine learning |

RNN | Recurrent neural network |

## Appendix A. Proof of Proposition 1

**Lemma**

**A1.**

**Proof.**

## Appendix B. Proof of Proposition 2

**Lemma**

**A2**

**.**Consider the sequence ${\{({U}_{i},{V}_{i})\}}_{i=1}^{n}$ is stationary and geometrically ϕ-mixing (see Definition 4). If $\frac{k(n)}{n}\to 0$ and $\frac{k(n)}{{(logn)}^{2}}\to \infty $, then

**Lemma**

**A3.**

**Proof.**

**Lemma**

**A4.**

**Proof.**

**Lemma**

**A5.**

**Proof.**

## Appendix C. Proof Lemma A4

## Appendix D. Proof Theorem 1

## References

- Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. MINE: Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
- Wang, Q.; Kulkarni, S.R.; Verdú, S. Universal estimation of information measures for analog sources. Found. Trends Commun. Inf. Theory
**2009**, 5, 265–353. [Google Scholar] [CrossRef] - Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E
**2004**, 69, 066138. [Google Scholar] [CrossRef] [PubMed][Green Version] - Mukherjee, S.; Asnani, H.; Kannan, S. CCMI: Classifier based Conditional Mutual Information Estimation. In Proceedings of the Uncertainty in Artificial Intelligence, Tel Aviv, Israel, 22–25 July 2019. [Google Scholar]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
- Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain markov process expectations for large time, I. Comm. Pure Appl. Math.
**1975**, 28, 1–47. [Google Scholar] [CrossRef] - Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory
**2010**, 56, 5847–5861. [Google Scholar] [CrossRef][Green Version] - Poole, B.; Ozair, S.; van den Oord, A.; Alemi, A.A.; Tucker, G. On variational lower bounds of mutual information. In Proceedings of the NeurIPS Workshop on Bayesian Deep Learning, Montréal, QC, Canada, 7–8 December 2018. [Google Scholar]
- Molavipour, S.; Bassi, G.; Skoglund, M. Conditional Mutual Information Neural Estimator. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 5025–5029. [Google Scholar]
- Molavipour, S.; Bassi, G.; Skoglund, M. Neural Estimators for Conditional Mutual Information Using Nearest Neighbors Sampling. IEEE Trans. Signal Process.
**2021**, 69, 766–780. [Google Scholar] [CrossRef] - Marko, H. The bidirectional communication theory-a generalization of information theory. IEEE Trans. Commum.
**1973**, 21, 1345–1351. [Google Scholar] [CrossRef] - Massey, J. Causality, Feedback and Directed Information. In Proceedings of the International Symposium on Information Theory and Its Applications (ISITA), Honolulu, HI, USA, 27–30 November 1990; pp. 303–305. [Google Scholar]
- Schreiber, T. Measuring information transfer. Phys. Rev. Lett.
**2000**, 85, 461. [Google Scholar] [CrossRef] [PubMed][Green Version] - Kramer, G. Directed Information for Channels with Feedback. Ph.D. Thesis, Department of Information Technology and Electrical Engineering, ETH Zurich, Zürich, Switzerland, 1998. [Google Scholar]
- Permuter, H.H.; Kim, Y.H.; Weissman, T. Interpretations of directed information in portfolio theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory
**2011**, 57, 3248–3259. [Google Scholar] [CrossRef] - Venkataramanan, R.; Pradhan, S.S. Source coding with feed-forward: Rate-distortion theorems and error exponents for a general source. IEEE Trans. Inf. Theory
**2007**, 53, 2154–2179. [Google Scholar] [CrossRef] - Tanaka, T.; Skoglund, M.; Sandberg, H.; Johansson, K.H. Directed information and privacy loss in cloud-based control. In Proceedings of the American Control Conference (ACC), Seattle, WD, USA, 24–26 May 2017; pp. 1666–1672. [Google Scholar]
- Rissanen, J.; Wax, M. Measures of mutual and causal dependence between two time series (Corresp.). IEEE Trans Inf. Theory
**1987**, 33, 598–601. [Google Scholar] [CrossRef] - Quinn, C.J.; Coleman, T.P.; Kiyavash, N.; Hatsopoulos, N.G. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. J. Comput. Neurosci.
**2011**, 30, 17–44. [Google Scholar] [CrossRef] [PubMed] - Cai, Z.; Neveu, C.L.; Baxter, D.A.; Byrne, J.H.; Aazhang, B. Inferring neuronal network functional connectivity with directed information. J. Neurophysiol.
**2017**, 118, 1055–1069. [Google Scholar] [CrossRef] [PubMed][Green Version] - Ver Steeg, G.; Galstyan, A. Information transfer in social media. In Proceedings of the 21st international conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 509–518. [Google Scholar]
- Quinn, C.J.; Kiyavash, N.; Coleman, T.P. Directed information graphs. IEEE Trans. Inf. Theory
**2015**, 61, 6887–6909. [Google Scholar] [CrossRef][Green Version] - Vicente, R.; Wibral, M.; Lindner, M.; Pipa, G. Transfer entropy—A model-free measure of effective connectivity for the neurosciences. J. Comput. Neurosci.
**2011**, 30, 45–67. [Google Scholar] [CrossRef] [PubMed][Green Version] - Chávez, M.; Martinerie, J.; Le Van Quyen, M. Statistical assessment of nonlinear causality: Application to epileptic EEG signals. J. Neurosci. Meth.
**2003**, 124, 113–128. [Google Scholar] [CrossRef] - Spinney, R.E.; Lizier, J.T.; Prokopenko, M. Transfer entropy in physical systems and the arrow of time. Phys. Rev. E
**2016**, 94, 022135. [Google Scholar] [CrossRef][Green Version] - Runge, J. Quantifying information transfer and mediation along causal pathways in complex systems. Phys. Rev. E
**2015**, 92, 062829. [Google Scholar] [CrossRef][Green Version] - Murin, Y. k-NN Estimation of Directed Information. arXiv
**2017**, arXiv:1711.08516. [Google Scholar] - Faes, L.; Kugiumtzis, D.; Nollo, G.; Jurysta, F.; Marinazzo, D. Estimating the decomposition of predictive information in multivariate systems. Phys. Rev. E
**2015**, 91, 032904. [Google Scholar] [CrossRef][Green Version] - Baboukani, P.S.; Graversen, C.; Alickovic, E.; Østergaard, J. Estimating Conditional Transfer Entropy in Time Series Using Mutual Information and Nonlinear Prediction. Entropy
**2020**, 22, 1124. [Google Scholar] [CrossRef] - Zhang, J.; Simeone, O.; Cvetkovic, Z.; Abela, E.; Richardson, M. ITENE: Intrinsic Transfer Entropy Neural Estimator. arXiv
**2019**, arXiv:1912.07277. [Google Scholar] - Aharoni, Z.; Tsur, D.; Goldfeld, Z.; Permuter, H.H. Capacity of Continuous Channels with Memory via Directed Information Neural Estimator. arXiv
**2020**, arXiv:2003.04179. [Google Scholar] - Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. Int. J. Neural Syst.
**2007**, 17, 253–263. [Google Scholar] [CrossRef] [PubMed] - Breiman, L. The individual ergodic theorem of information theory. Ann. Math. Stat.
**1957**, 28, 809–811. [Google Scholar] [CrossRef] - Kontoyiannis, I.; Skoularidou, M. Estimating the directed information and testing for causality. IEEE Trans. Inf. Theory
**2016**, 62, 6053–6067. [Google Scholar] [CrossRef][Green Version] - Molavipour, S.; Bassi, G.; Skoglund, M. Testing for directed information graphs. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 3–6 October 2017; pp. 212–219. [Google Scholar]
- Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw.
**1989**, 2, 359–366. [Google Scholar] [CrossRef] - Devroye, L.; Gyorfi, L.; Krzyzak, A.; Lugosi, G. On the strong universal consistency of nearest neighbor regression function estimates. Ann. Stat.
**1994**, 22, 1371–1385. [Google Scholar] [CrossRef] - Collomb, G. Nonparametric time series analysis and prediction: Uniform almost sure convergence of the window and k-NN autoregression estimates. Statistics
**1985**, 16, 297–307. [Google Scholar] [CrossRef] - Yakowitz, S. Nearest-neighbour methods for time series analysis. J. Time Ser. Anal.
**1987**, 8, 235–247. [Google Scholar] [CrossRef] - Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer Science & Business Media: Dordrecht, The Netherlands, 2012. [Google Scholar]
- Raleigh, G.G.; Cioffi, J.M. Spatio-temporal coding for wireless communication. IEEE Trans. Inf. Theory
**1998**, 46, 357–366. [Google Scholar] [CrossRef] - Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica
**1969**, 37, 424–438. [Google Scholar] [CrossRef] - Kamarianakis, Y.; Prastacos, P. Space–time modeling of traffic flow. Comput. Geosci.
**2005**, 31, 119–133. [Google Scholar] [CrossRef][Green Version] - Molavipour, S.; Bassi, G.; Čičić, M.; Skoglund, M.; Johansson, K.H. Causality Graph of Vehicular Traffic Flow. arXiv
**2020**, arXiv:2011.11323. [Google Scholar] - Ross, S.M.; Peköz, E.A. A Second Course in Probability. 2007. Available online: www.bookdepository.com/publishers/Pekozbooks (accessed on 20 May 2021).
- Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
- Györfi, L.; Härdle, W.; Sarda, P.; Vieu, P. Nonparametric Curve Estimation from Time Series; Springer: Berlin/Heidelberg, Germany, 2013; Volume 60. [Google Scholar]
- Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]

**Figure 1.**The memory considered for conditional mutual information terms in directed information (

**left**) and transfer entropy (

**right**) at time instance i. To compute directed information (

**left**), the effect of ${X}^{i}$ (i.e., ${X}_{i}$ and all its past samples) on ${Y}_{i}$ is considered, while the history of ${Y}_{i}$ is excluded. However, for transfer entropy (

**right**), the effect of ${X}_{i-J}^{i-1}$ (i.e., the previous J samples before ${X}_{i}$) on ${Y}_{i}$ is accounted for, while we exclude the history of ${Y}_{i}$. Note that the length of memories (J and L) for transfer entropy may differ.

**Figure 2.**Construction of the product batch from the data set which is expressed as the left table. Let ${w}_{i}=1$, and the z component of the rows denoted with ‘*’ (indexed with ${j}_{1}$ and ${j}_{2}$) are in the k nearest neighborhood of ${z}_{i}$ for $k=2$. So we pack the triples $({x}_{{j}_{1}},{y}_{i},{z}_{i})$ and $({x}_{{j}_{2}},{y}_{i},{z}_{i})$ in the product batch as in the right table.

**Figure 3.**Estimated CMI for AR-1 model in (30) using $n=2\times {10}^{4}$ samples with $d=1$. The shaded region shows the range of the estimated values over the Monte Carlo trials.

**Figure 4.**Estimated CMI for AR-1 model in (30) using $n=2\times {10}^{4}$ samples with $d=10$. The shaded region shows the range of the estimated values over the Monte Carlo trials. Blue shades correspond to estimation with our method, yellow shades correspond to estimation with MI-diff approach and the green shade is the overlap of the areas.

**Figure 6.**Graphical representation of the causal influences between the processes using pairwise directed information (

**a**), and causally conditioned directed information (

**b**).

Hidden units | 64 |

Hidden layers | 2 (64 × 64) |

Activation | ReLU |

$\tau $ | ${10}^{-3}$ |

Optimizer | Adam |

Learning rate | ${10}^{-3}$ |

Epochs | 200 |

True DIR | Estimation with Our Method (Mean ± Std) | |
---|---|---|

$I(\mathbf{X}\to \mathbf{Y})$ | $0.59$ | $0.57\pm 0.00$ |

$I(\mathbf{X}\to \mathbf{Z})$ | $0.57$ | $0.55\pm 0.00$ |

$I(\mathbf{Y}\to \mathbf{Z})$ | $1.99$ | $1.92\pm 0.01$ |

$I(\mathbf{Y}\to \mathbf{X})$ | 0 | $0.00\pm 0.00$ |

$I(\mathbf{Z}\to \mathbf{X})$ | 0 | $0.00\pm 0.00$ |

$I(\mathbf{Z}\to \mathbf{Y})$ | 0 | $0.00\pm 0.00$ |

True DIR | Estimation with Our Method (Mean ± Std) | |
---|---|---|

$I(\mathbf{X}\to \mathbf{Y}\phantom{\rule{0.277778em}{0ex}}\parallel \phantom{\rule{0.277778em}{0ex}}\mathbf{Z})$ | $0.59$ | $0.57\pm 0.00$ |

$I(\mathbf{X}\to \mathbf{Z}\phantom{\rule{0.277778em}{0ex}}\parallel \phantom{\rule{0.277778em}{0ex}}\mathbf{Y})$ | 0 | $0.00\pm 0.00$ |

$I(\mathbf{Y}\to \mathbf{Z}\phantom{\rule{0.277778em}{0ex}}\parallel \phantom{\rule{0.277778em}{0ex}}\mathbf{X})$ | $1.42$ | $1.52\pm 0.01$ |

$I(\mathbf{Y}\to \mathbf{X}\phantom{\rule{0.277778em}{0ex}}\parallel \phantom{\rule{0.277778em}{0ex}}\mathbf{Z})$ | 0 | $0.01\pm 0.00$ |

$I(\mathbf{Z}\to \mathbf{X}\phantom{\rule{0.277778em}{0ex}}\parallel \phantom{\rule{0.277778em}{0ex}}\mathbf{Y})$ | 0 | $0.01\pm 0.00$ |

$I(\mathbf{Z}\to \mathbf{Y}\phantom{\rule{0.277778em}{0ex}}\parallel \phantom{\rule{0.277778em}{0ex}}\mathbf{X})$ | 0 | $0.01\pm 0.00$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Molavipour, S.; Ghourchian, H.; Bassi, G.; Skoglund, M. Neural Estimator of Information for Time-Series Data with Dependency. *Entropy* **2021**, *23*, 641.
https://doi.org/10.3390/e23060641

**AMA Style**

Molavipour S, Ghourchian H, Bassi G, Skoglund M. Neural Estimator of Information for Time-Series Data with Dependency. *Entropy*. 2021; 23(6):641.
https://doi.org/10.3390/e23060641

**Chicago/Turabian Style**

Molavipour, Sina, Hamid Ghourchian, Germán Bassi, and Mikael Skoglund. 2021. "Neural Estimator of Information for Time-Series Data with Dependency" *Entropy* 23, no. 6: 641.
https://doi.org/10.3390/e23060641