Neural Estimator of Information for Time-Series Data with Dependency

Novel approaches to estimate information measures using neural networks are well-celebrated in recent years both in the information theory and machine learning communities. These neural-based estimators are shown to converge to the true values when estimating mutual information and conditional mutual information using independent samples. However, if the samples in the dataset are not independent, the consistency of these estimators requires further investigation. This is of particular interest for a more complex measure such as the directed information, which is pivotal in characterizing causality and is meaningful over time-dependent variables. The extension of the convergence proof for such cases is not trivial and demands further assumptions on the data. In this paper, we show that our neural estimator for conditional mutual information is consistent when the dataset is generated with samples of a stationary and ergodic source. In other words, we show that our information estimator using neural networks converges asymptotically to the true value with probability one. Besides universal functional approximation of neural networks, a core lemma to show the convergence is Birkhoff’s ergodic theorem. Additionally, we use the technique to estimate directed information and demonstrate the effectiveness of our approach in simulations.


Introduction
In recent decades, a tremendous effort has been done to explore capabilities of feedforward networks and their application in various areas. Novel machine learning (ML) techniques go beyond conventional classification and regression tasks and enable revisiting well-known problems in fundamental areas such as information theory. The functional approximation power of neural networks is a compelling tool to be used for estimating information-theoretic quantities such as entropy, KL-divergence, mutual information (MI), and conditional mutual information (CMI). As an example, MI is estimated with neural networks in [1] where numerical results show notable improvements compared to the conventional methods for high-dimensional, correlated data.
Information-theoretic quantities are characterized by probability densities and most classical approaches aim at estimating the densities. These techniques may vary depending on whether the random variables are discrete or continuous. In this paper, we focus on continuous random variables. Examples of conventional non-parametric methods to estimate these quantities are histogram and partitioning techniques, where the densities are approximated and plugged-in into the definitions of the quantities, or methods based on the distance of the k-th nearest neighbor [2]. Despite vast applications of nearest neighbor methods for estimation of information-theoretic quantities, such as the proposed technique in [3], recent studies advocate using neural networks while simulations demonstrate that the accuracy of the estimations improves in several scenarios [1,4]. In particular, the results indicate that by increasing the dimension of the data, the bias of the estimation to ergodic processes; however, this generalization is not always trivial. The estimator proposed in [11] exhibits major improvements in estimating the CMI. Nevertheless, it is based on a k-nearest neighbors (k-NN) sampling technique which makes the extension of the convergence proofs to non-i.i.d. data more involved. The main contribution of this paper is to provide convergence results and consistency proofs for this neural estimator when the data are stationary and ergodic Markov.
The paper is organized as follows. Notations and basic definitions are introduced in Section 2. Then, in Section 3, the neural estimator and procedures are explained. Additionally, the convergence of the estimator is studied when the data are generated from a Markov source. Next, we provide simulation results in Section 4 for synthetic scenarios and verify the effectiveness of our technique in estimating CMI and DI. Finally, we conclude the paper in Section 5 and suggest potential future directions.

Preliminaries
We begin by describing the notation used throughout the paper, and the main definitions are explained afterwards. Then we review variational bounds which are the basis of our neural estimator.

Notation
Random variables and their realizations are denoted by capital and lower case letters, respectively. Given two integers i and j, a sequence of random variables X i , X i+1 , . . . , X j is shown as X j i , or simply X j when i = 1. For a stochastic processes Z, a randomly generated sample is denoted by random variable Z. We indicate sets with calligraphic notation (e.g., X ). The space of d-dimensional real vectors is shown as R d . The probability density function (PDF) of a random variable X at X = x is denoted by p X (x) or equivalently p(x), and the distribution of X, by P X or simply P. The PDF of multiple random variables X 1 , . . . , X i is p X 1 ...X i (x 1 , . . . , x i ) and for simplicity it is represented by p(x 1 , . . . , x i ) in the paper. For the distribution P, E P [·] denotes the expectation with respect to its density p(·). All the logarithms are in base e.
The convergence of the sequence X n almost surely (or with probability one) to X is denoted by X n a.s. → X and is defined as: P lim n→∞ X n = X = 1.

Information Measures
The information-theoretic quantities of interest for this work can be written in terms of a KL-divergence, and the available neural estimators originally aim to estimate this quantity. For a random variable X with support X ⊆ R d , the KL-divergence between two PDFs p(x) and q(x) is defined as: Then, CMI can be defined using KL-divergence as below: I(X; Y|Z) := D(p(x, y, z) p(x|z)p(y, z)).
where Y and Z are random variables with support on Y and Z, which are subsets of R d . In this paper, we are focused on extending the estimators for CMI with non-i.i.d. data, where samples in time-series data might not be independently and identically distributed (e.g., generated from a Markov process); nonetheless, our method and consistency proofs are fairly general and can be applied for estimating KL-divergence as well. Consider a sequence of random samples {(X i , Y i , Z i )} n i=1 generated from the joint process (X, Y, Z), where the samples are not necessarily i.i.d.. A simple step toward this extension is to verify that the previous neural estimators, e.g., [11], can be used to estimate I(X; Y|Z), where (X, Y, Z) ∼ p(x, y, z) and the processes (X, Y, Z) are Markov, as in the following assumption.
Assumption 1. (X, Y, Z) are jointly stationary and ergodic 1-st order Markov with marginal density p(x, y, z). The extension of the results to d-th order Markov is straightforward.
To explore further in generalizing the neural estimators, it is possible to investigate their capability for information measures that rely on dependent random variables. Consider the pairs {(X i , Y i )} n i=1 to be samples of the processes (X, Y). If the generated samples are dependent in time, it is possible to measure the causal relationship between the processes with quantities such as DI and TE, defined as below: where J and L are parameters of the TE that determine the length of memory to consider for X and Y, respectively. Both quantities are functions of the CMI and Figure 1 visualizes the corresponding variables in each CMI term for DI and TE. In particular, each CMI term in (3) quantifies the amount of shared information between X i and Y i conditioned on Y i−1 , i.e., it excludes the effect of the causal history of Y. In a general form, to express the causal effect of the process X on Y conditioning causally on Z, DI is normalized with respect to n which is defined below and denoted as directed information rate (DIR): By assuming the processes to be Markov, (5) can be simplified (see [23,35,36]). To be explicit, if both (X, Y, Z) and (Y, Z) are stationary and ergodic 1-st order Markov, from (5) the DIR can be simplified as: where the CMI is with respect to the stationary density p(x 2 , y 2 , z 2 ) of the Markov model. To generalize this approach, let us define the maximum Markov order (o max ) of a set of processes to be the minimum number o such that the Markov order of the joint random variables of any subset of the processes is less than or equal to o. So if o max = l for (X, Y, Z), then from (5) we can simplify the DIR term as: The following example shows how DIR can be computed for a linear data model, and emphasizes on the difference when DIR is conditioned causally on another process.
Version May 15, 2021 submitted to Entropy 4 of 28 Figure 1. The memory considered for conditional mutual information terms in directed information (left) and transfer entropy (right) at time instance i. To compute directed information (left), the effect of X i (i.e., X i and all its past samples) on Y i is considered, while the history of Y i is excluded. However, for transfer entropy (right), the effect of X i−1 i−J (i.e., the previous J samples before X i ) on Y i is accounted for, while we exclude the history of Y i . Note that the length of memories (J and L) for transfer entropy may differ. and consistency proofs are fairly general and can be applied for estimating KL-divergence as well.

119
Consider a sequence of random samples {(X i , Y i , Z i )} n i=1 generated from the joint process (X, Y, Z), 120 where the samples are not necessarily i.i.d.. A simple step toward this extension is to verify that the 121 previous neural estimators, e.g., [11], can be used to estimate I(X; Y|Z), where (X, Y, Z) ∼ p(x, y, z) 122 and the processes (X, Y, Z) are Markov, as in the following assumption. p(x, y, z).

125
To explore further in generalizing the neural estimators, it is possible to investigate their capability 126 for information measures that rely on dependent random variables. Consider the pairs be samples of the processes (X, Y). If the generated samples are dependent in time, it is possible to 128 measure the causal relationship between the processes with quantities such as DI and TE, defined as 129 below: where J and L are parameters of the TE that determine the length of memory to consider for X and 131 Y, respectively. Both quantities are functions of the CMI and Figure 1 visualizes the corresponding 132 variables in each CMI term for DI and TE. In particular, each CMI term in (3) quantifies the amount 133 of shared information between X i and Y i conditioned on Y i−1 , i.e., it excludes the effect of the causal 134 history of Y. In a general form, to express the causal effect of the process X on Y conditioning causally 135 on Z, DI is normalized with respect to n which is defined below and denoted as directed information 136 rate (DIR): By assuming the processes to be Markov, (5) can be simplified (see [23,35,36] The extension of the results to d-th order Markov is straightforward. Figure 1. The memory considered for conditional mutual information terms in directed information (left) and transfer entropy (right) at time instance i. To compute directed information (left), the effect of X i (i.e., X i and all its past samples) on Y i is considered, while the history of Y i is excluded. However, for transfer entropy (right), the effect of X i−1 i−J (i.e., the previous J samples before X i ) on Y i is accounted for, while we exclude the history of Y i . Note that the length of memories (J and L) for transfer entropy may differ.
are uncorrelated white Gaussian noises with variances σ 2 x ,σ 2 y , and σ 2 z respectively: for some |a| < 1, and (X 0 , Y 0 , Z 0 ) are distributed according to the stationary distribution of the processes X, Y, and Z. This model holds in Assumption 1 and o max = 1, so I(X → Y) can be computed as: while from (7): As emphasized earlier, (7) holds when (X, Y, Z) and (Y, Z) are Markov with order l. Then the CMI estimators can be used potentially to estimate the DIR. However, the consistency of the estimation still needs to be investigated since the samples are not independent. Before introducing our technique, we review the basics for estimating information measures with neural networks.

Estimating the Variational Bound
The estimators proposed in [1,4,11] are all based on tight lower bounds on the KLdivergence, such as the DV bound, introduced in [7]: where p and q are two PDFs defined over X with corresponding distributions P and Q, respectively, and F is any class of functions such that f : X → R, and the two expectations exist and are finite. Consider a neural network with parameters θ ∈ Θ, then F can be to the class of all functions constructed with this neural network by choosing different values for the parameters θ. In more details, let f (x) to be the end-to-end function of a neural network with parameters θ ∈ Θ and the optimization in the right hand side (RHS) of (9) is equivalent to optimizing over Θ (as performed in [1]). Nevertheless, we can leverage from the fact that the DV bound is tight when the function is chosen as: Thus, the neural network can approximate f * (x) directly and the lower bound can be computed accordingly (as performed in [4,11]).

Definition 1.
For the PDFs p(x, y, z) and p(x|z)p(y, z), define the corresponding distributions on X × Y × Z to beP andQ, respectively.
Since the CMI can be stated as a KL-divergence (2), the DV bound can be defined for CMI as bellow: and the bound is tight by choosing The main barrier to compute this bound for f * (x, y, z) is that the densities are unknown. This challenge is addressed in [4,11] by proposing neural classifiers that can approximate f * (x, y, z) without knowing the densities. Below we review the steps of the estimation technique provided in [11]: (1) Construct the joint batch, containing samples generated according to p(x, y, z).
(2) Construct the product batch, containing samples generated according to p(x|z)p(y, z).
(3) Train the neural network with a particular loss function, which we explain later, to approximate f * (x, y, z), i.e., the density ratio of p(x,y,z) p(x|z)p(y,z) . (4) Compute (11) using the batches and the approximated function.
To show the consistency of the estimation with this approach, it is crucial to verify if the empirical average with respect to each sample batch converges asymptotically to the corresponding expectations. Additionally, the neural network should be designed and trained to be capable of approximating the density ratio. For i.i.d. data samples, the authors in [4,11] provided the proofs in the form of concentration bounds. In this paper, we extend these proofs for non-i.i.d. data by providing convergence results for the special case of stationary and ergodic Markov processes. In the remainder of the paper, we denote the data by {(X i , Y i , Z i )} n i=1 which are consecutive samples of the stationary Markov processes (X, Y, Z) with marginal PDF p(x, y, z).

Main Results
In this section, we describe our proposed neural estimator in detail. To create the batches, the estimator is equipped with a k-NN sampling block such that the empirical average over the samples converges to the expected mean. Next, we describe the roadmap to show the convergence of the estimation to the true value (i.e., consistency analysis).

Batch Construction
To create the joint batch it is sufficient to take (X i , Y i , Z i ) randomly from the available data. Below we define the joint batch formally using an auxiliary random variable that indicates whether an instance is selected or not (see also Algorithm 1 for the implementation).

Algorithm 1: Construction of the joint batch
. . , n be independent random variables, where we use I α,n to simplify the notation.
Please note that by the law of large numbers, the length of the joint batch is asymptotically αn. Next, to construct the product batch we use the method based on the k-NN technique, which is introduced in [11]. Below we define our method denoted by isolated k-NN technique, and explain how the product batch is constructed (see also Algorithm 2).

Algorithm 2: Construction of the product batch
Then for any ζ ∈ Z and given the data {(x i , y i , z i )} n i=1 , define A α ,k,n,s (ζ, z n , w s ) as the set of indices of the k nearest neighbors of ζ (by Euclidean distance) among {z i } for i ∈ I c α ,s (w s ). (1), . . . , π(k)}. So the product batch can be defined as: Hereafter we use I α ,s , I c α ,s , and A α (ζ) instead as the remaining parameters can be understood from the context. We refer to this sampling technique as isolated k-NN in the sequel. An example is also provided in Figure 2 for the case of k = 2.
Version May 15, 2021 submitted to Entropy 7 of 28 Algorithm 1: Construction of the joint batch Figure 2. Construction of the product batch from the data set which is expressed as the left table. Let w i = 1, and the z component of the rows denoted with ' * ' (indexed with j 1 and j 2 ) are in the k nearest neighborhood of z i for k = 2. So we pack the triples (x j 1 , y i , z i ) and (x j 2 , y i , z i ) in the product batch as in the right table.
in [11]. Below we define our method denoted by isolated k-NN technique, and explain how the product 202 batch is constructed (see also Algorithm 2).

203
Definition 3 (Product batch). For s < n, let W i ∼ Bernoulli(α ) for i = 1, . . . , s be independent random 204 variables, and Then for any ζ ∈ Z and given the data So the product batch can be defined as: Hereafter we use I α ,s , I c α ,s , and A α (ζ) instead as the remaining parameters can be understood from the context. 210 We refer to this sampling technique as isolated k-NN in the sequel. An example is also provided in Figure 2 for 211 the case of k = 2.

Remark 1.
Here we emphasize that the isolated indices are selected from the first s indices of samples while the neighbors can be searched among all n indices of data except the ones in I α ,s (w s ). Additionally, note that the length of the product batch is α sk asymptotically as n → ∞ because sk also tends to ∞ as we see later in the assumptions of Proposition 2.

Training the Classifier
As explained earlier, the optimal function for a tight lower bound on the CMI is obtained by the density ratio and to compute that we use the functional approximation power of neural networks. Consider a feedforward neural network with the last layer equipped with the sigmoid function. The network is parameterized with θ ∈ Θ ⊆ R h where h is the number of parameters, and the neural network function is denoted by For an input (X, Y, Z) of the network, let C ∈ {0, 1} denote the class of the input which determines that the tuple is generated according to p(x, y, z) or p(x|z)p(y, z). To be explicit, the input is either picked from the joint batch (class C = 1) or the product batch (class C = 0), and the goal is to learn the network parameters such that it can distinguish the class of new (unseen) queries. Let the loss function be the binary cross-entropy function. So for ω to be any function with inputs (x, y, z) and ranging between [0, 1], the expected loss is defined as: It is well-established that by minimizing L(ω), the solution ω * would represent the probability of classifying the input in the class C = 1 given the input data, i.e., P C = 1|x, y, z . In fact, as shown in [11] (Lemma 1) if the prior distribution on the classes is unbiased, by taking the derivative in (15) we have: So from (12) the optimal function can be expressed with Γ(x, y, z) as: Therefore, by training the neural network, we can approximate the optimal function f * (x, y, z) and estimate the lower bound for CMI. Consider the neural network ω θ , then the empirical loss function is defined as: and the optimal parameters are obtained by solving the following problem: Consequently, we can approximate the density ratio Γ(x, y, z) from (16): To avoid having boundary values (i.e., ωθ(x, y, z) close to zero or 1), the output of the neural network is clipped between [τ, 1 − τ] for some small τ > 0.

Remark 2.
Please note thatΓ(x, y, z) approximates the density ratio, if the batch sizes B α joint and B α ,s prod are balanced. Otherwise, (20) requires a correction coefficient (see [11]). To fulfill this, given the number of samples n, one can choose the parameters such that αn = α sk. Then, by the law of large numbers, the batches will asymptotically be balanced.

Estimation of the DV Bound
The final step in the estimation of CMI is to compute the lower bound (11) empirically usingΓ(x, y, z). So by substituting the expectations with empirical averages with respect to samples in the joint and the product batch, the CMI estimator is defined as: In practice, to mitigate the induced inaccuracy due to sampling from the original data, the training and estimation is repeated for several sampling trials. The steps for implementing the estimator are described in Algorithm 3. In the next part, we provide the convergence results for our estimator to validate substitution of the expectations in (11) with empirical averages with respect to the joint and the product batch. Then we show the convergence of the overall estimation to the true CMI value.

Consistency Analysis
The consistency of our neural estimator (i.e., showing that the estimator converges to its true value) is based on the universal functional approximation power of neural networks and concentration results for the samples collected in the joint batch and in the product batch using the isolated k-NN. Informally, Hornik's functional approximation theorem [37] guarantees that feedforward neural networks are capable of fitting any continuous function. So depending on the true density of the data, there exists a choice of parametersθ that enables approximating the desired function with any arbitrary accuracy. Next, we show that the empirical loss function L emp (ω θ ) is concentrated around its mean L(ω θ ) for any θ. Combining these tools, we are able to minimize the empirical loss function as in (19) and we expectθ to be close toθ asymptotically; thus, eventuallyΓ(x, y, z) properly approximates Γ(x, y, z). Additionally, the empirical computation of the DV bound is concentrated around the expected value which concludes the consistency of the end-to-end estimation of the CMI.
In this paper, we put the main focus on extending the concentration results provided in [11] (Proposition 1) with Markov assumption on data. Although conventionally many asymptotic results for i.i.d. data are assumed to hold for Markov data as well, the required extensions here are more involved due to the additional complexity of the k-NN method. In the following, we first show the convergence of the empirical average for the joint batch, where g(·) is any measurable function such that the expectation exists and is finite. As the product batch collects samples corresponding to the k nearest neighbors, convergence results for nearest neighbor regression are invoked to show that the empirical average for the product batch converges to the expectation with respect to the product distributionQ, Then, we conclude the consistency of the overall estimation.

Convergence for the Joint Batch
One well-known extension to the law of large numbers for non-i.i.d. processes is Birkhoff's ergodic theorem, and is the basis of our proof to show the following proposition on the convergence of the sample average over the joint batch. Proposition 1. Consider the sequence of random variables {(X i , Y i , Z i )} n i=1 generated under Assumption 1. Consider the distributionP in Definition 1, for any measurable function g(·) such that EP g(X, Y, Z) exists and is finite, Proof. See Appendix A.

Convergence for the Product Batch
From Definition 3, the empirical summation over all samples in the product batch is equivalent to averaging I α ,s times k-NN regressions. Considering a sequence of pairs j=1 V r j where r j refers to the j-th nearest neighbor of u among U 1 , . . . , U n . This problem has been well studied when the pairs (U i , V i ) are generated i.i.d.. For example in [38], the authors show the convergence of m n (u) as: for some positive constant a, when k(n) → ∞ and k(n) n → 0. However, if the pairs are not independent, convergence results require a more advanced condition denoted geometric φ-mixing condition or geometric ergodicity condition [39,40]. As argued in [39], the geometric ergodicity is not a restrictive statement and holds for a wide range of processes (see also [41]). For instance, linear autoregressive processes are geometrically ergodic [41] (Ch. 15.5.2). Below we review the φ-mixing condition.
Definition 4 (φ-mixing condition). A process U is φ-mixing if for a sequence {φ n } n∈N of positive numbers satisfying φ n → 0 as n → ∞, for any integer i > 0 we have: for all n > 0 and all sets A and B which are members of σ(U 1 , . . . , U n ) and σ(U n+i , U n+i+1 , . . . ), respectively. If {φ n } is a geometric sequence, U is called geometrically φ-mixing.
To show the convergence of the empirical average over the product batch, we make the following assumptions.
Assumption 3. We assume that Y and Z are compact.

Proposition 2.
Let the sequence of random variables {(X i , Y i , Z i )} n i=1 be generated under Assumptions 1-3, and we choose k(n) and s(n) such that: ConsiderQ defined in Definition 1. Then, for any function g(·) such that EQ g(X, Y, Z) exists and is finite, and additionally, g(x, y 1 , z) − g(x, y 2 , z) < L g y 1 − y 2 ∀x ∈ X , z ∈ Z, y 1 , y 2 ∈ Y, (26) where L g > 0 is the Lipschitz constant, we have that: Proof. See Appendix B.

Remark 3.
Examples of choices for k(n) and s(n) satisfying (25) are for instance k(n) = n 1 2 and k(n) = (log n) 2+ for some > 0. Please note that in [11], the consistencies are shown when k(n) = Θ(n 1 2 ). However, the convergence result in [11] (Theorem 1) is an explicit bound, so the condition on k(n) can be relaxed (choosing a smaller k(n)) when we are only interested in the asymptotic behavior.

Convergence of the Overall Estimation
To complete our analysis on the consistency of the neural estimator, it is required to show that the loss function is properly approximated and it converges to the optimal loss as n increase. The following assumptions on the neural network and the densities enable us to show this convergence.

Assumption 5.
There exist 0 < p min < p max < ∞ such that for all x, y, z ∈ X × Y × Z, the values of p(x, y, z) and p(x|z)p(y, z) are both in the interval [p min , p max ], and it holds that to guarantee that τ ≤ ω * ≤ 1 − τ.
Proof. See Appendix D.
In the next section, we apply our estimator in several synthetic scenarios to verify its capability in estimating CMI and DI.

Simulation Results
In this section, we experiment with our proposed estimator of CMI and DI in the following auto-regressive model which is widely used in different applications, including wireless communications [42], defining causal notions in econometrics [43], and modeling traffic flow [44], among others: where A and B are 3 × 3 matrices and the rest of variables are d-dimensional row vectors.
A models the instantaneous effect of X i , Y i , and Z i on each other and its diagonal elements are zero, while B models the effect of previous time instance. N x i , N y i , and N z i (denoted as noise in some contexts) are independent and generated i.i.d. according to zero-mean Gaussian distributions with covariance matrices σ 2 x I d , σ 2 y I d , and σ 2 z I d , respectively (i.e., the dimensions are d and components are uncorrelated). Please note that this model fulfills Assumptions 1 and 2 by setting appropriate initial random variables. Although the Gaussian random variables do not range in a compact set and thus, Assumption 3 does not hold, we could use truncated Gaussian distributions. Such adjustment does not significantly change the statistics of the generated dataset since the probability of finding a value far away from the mean is negligible.
In the following section, we test the capability of our estimator in estimating both conditional mutual information (CMI) and directed information (DI). In both cases, n samples are generated from the model and the estimations are performed according to Algorithms 1 and 2. Then according to Algorithm 3, the joint and product batches are split randomly in half to construct train and evaluation sets. Then the parameters of the classifier are trained with the train set and the final estimation is computed with the evaluation set (Codes are available at https://github.com/smolavipour/Neural-Estimatorof-Information-non-i.i.d, accessed on 20 May 2021).
To verify the performance of our technique, we also compared it with the approach taken in [4,31] which is as follows. Conditional mutual information can be computed by subtracting two mutual information terms, i.e., I(X; Y|Z) = I(X; Y, Z) − I(X; Z).
So instead of estimating the CMI term directly, one can use a neural estimator such as the classifier based estimator in [4] or the MINE estimator [1], and estimate each MI term in (31) to estimate the CMI. In what follows, we refer to this technique as MI-diff since it computes the difference between two MI terms.

Estimating Conditional Mutual Information
In this scenario, we estimate I(X 1 ; Y 1 |Z 1 ) when A and B are chosen to be: Then from (30), the CMI can be computed as below: Each estimated value is an average of T = 20 estimations, where in each round the batches are re-selected while having a fixed dataset. This procedure is repeated for 10 Monte Carlo trials and the data are re-generated for each trial. The hyper-parameters and settings of the experiment are provided in Table 1. In Figure 3, the CMI is estimated (asÎ n,T DV (X 1 ; Y 1 |Z 1 ) in Algorithm 3) with n = 2 × 10 4 samples with dimension d = 1 when σ y = 2, σ z = 2 and by varying σ x . It can be observed that the estimator can properly estimate the CMI while the variance of the estimation is also small. The latter can be inferred from the shaded region, which indicates the range of estimated CMI for a particular σ x over all Monte Carlo trials. Next, the experiment is repeated for d = 10 and the results are depicted in Figure 4, where we compare our estimation of CMI with the MI-diff approach, which is explained in (31) and each MI term is estimated with the classifier-based estimator proposed in [4]. It can be observed that the means of both estimators are similar; nonetheless, estimating the CMI directly is more accurate and has less variation compared to the MI-diff approach. Additionally, our method is faster since it computes the information term only once, while in the MI-diff approach, two different classifiers are trained to estimate each MI term.

Estimating Directed Information
DI can explain the underlying causal relationship among processes. This notion has wide applications in various areas. For example, consider a social network where the activities of users are monitored (e.g., the messages times as studied in [23]). The DI between these time-series data expresses how the activity of one user can affect the activity of the others. In addition, to such data analytic applications, DI characterizes the capacity of communication channels with feedback and by estimating the capacity, rates and powers of transmission can be adjusted in radio communications (see for example [32]). Now in this experiment, consider a network of three processes X, Y, and Z, such that the time-series data are modeled with (30) In this model, where the relations are depicted in Figure 5, the process X is affecting Y with a delay and similarly the signal of Y appears on Z in the next time instance while an independent noise is accumulated on both steps. The DIR from X → Y in this network can be computed as follows:  Similarly, for the link Y → Z, we have: Similarly, for the link Y → Z, we have: Next we can compute the true DIR for the link X → Z as: Please note that the DIR corresponding to other links (i.e., the above links in the reverse direction) is zero by similar computations. Suppose we represent the causal relationships with a directed graph, where a link between two nodes exists if the corresponding DIR is non-zero. Then according to (34)- (36), the causal relationships are described with the graph of Figure 6a.  Similarly, for the link Y → Z, we have: Next we can compute the true DIR for the link X → Z as: Version May 15, 2021 submitted to Entropy 16 of 28 Similarly, for the link Y → Z, we have: Next we can compute the true DIR for the link X → Z as: Figure 6. Graphical representation of the causal influences between the processes using pairwise directed information (a), and causally conditioned directed information (b).
To estimate the DIR, note that the processes are Markov and the maximum Markov order (o max ) for the set of all processes is o max = 2 according to (30) and (33). Hence by (7), we can estimate DIR with the CMI estimator. For instance the DIR for processes (X, Y) can be obtained by: , where the right hand side is computed similar to (21). We performed the experiment with n = 2 × 10 5 samples of dimension d = 1 generated according to the model (30) and (33) with b 1 = 1, b 2 = 2, σ x = 3, σ y = 2, and σ z = 1, while the settings of the neural network were chosen as in Table 1. The estimated values are stated in Table 2. It can be seen that the bias of the estimator is fairly small while the variance of the estimations is negligible. This is inline with the observations in [11] when estimating CMI for i.i.d. case.

True DIR Estimation with Our Method (Mean ± Std)
I(X → Y) 0.59 0.57 ± 0.00 I(X → Z) 0.57 0.55 ± 0.00 I(Y → Z) 1.99 1.92 ± 0.01 I(Y → X) 0 0.00 ± 0.00 I(Z → X) 0 0.00 ± 0.00 I(Z → Y) 0 0.00 ± 0.00 Although I(X → Z) > 0, intuitively X is only affecting Z causally through Y, which suggests that I(X → Z Y) = 0. This event is referred to as proxy effect when studying directed information graph (see [45]). In fact the graphical representation of causal relationships can be simplified using the notion of causally conditioned DIR as depicted in Figure 6b. To see this formally, note that from (30) it yields that: Considering o max = 2, the causally conditioned DIR terms can be estimated with our CMI estimator according to (7); for instance, The estimation results are provided in Table 3 for all the links, where for each link we averaged over T = 20 estimations (as in Algorithm 3); then the procedure is repeated for 10 Monte Carlo trials in which we generate a new dataset according to the model.

True DIR Estimation with Our Method (Mean ± Std)
I(X → Y Z) 0.59 0.57 ± 0.00 In this experiment, we did not explore the effect of higher dimensions for data, although one should note that for the causally conditioned DIR estimation, with d = 1 the neural network is fed with data of size 9. Nevertheless, the performance of higher dimensions for this estimator with i.i.d. data has been studied in [11] and the challenges of dealing with high dimensions when data has dependency can be considered to be a future direction of this work. Additionally, although the information about o max may not always be available in practice, it can be approximated by data-driven approaches similar to the method described in [45].

Conclusions and Future Directions
In this paper, we explored the potentials of a neural estimator for information measures when there exist time dependencies among the samples. We extended the analysis on the convergence of the estimation and provided experimental results to show the performance of the estimator in practice. Furthermore, we compared our estimation method with a similar approach taken in [4,31] (which we denoted as MI-diff), and demonstrations on synthetic scenarios show that the variances of our estimations are smaller. However, the main contribution is the derivation of proofs of convergence when the data are generated from a Markov source. Our estimator is based on a k-NN method to re-sample the dataset such that the empirical average over the samples converges to the expectation with certain density. The convergence result derived for the re-sampling technique is stand-alone and can be adopted in other sampling application.
Our proposed estimator can be used potentially in the areas of information theory, communication systems, and machine learning. For instance, the capacity of channels with feedback can be characterized with directed information and estimated with our estimator and can be investigated as a future direction. Furthermore, in machine learning applications where the data has some form of dependency (either spatial of temporal), regularizing the training with information flow requires the estimator of information to capture causality which is considered in our technique. Finally, information measures can be used in modeling and controlling a complex system and the results in this work can provide meaningful measures such as conditional dependence and causal influence.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: To show the convergence stated in the Proposition, let us first introduce the following lemma which is a variant of Birkhoff's ergodic theorem for the case where the samples are not necessarily subsequent.
Lemma A1. Let U n be n observations of a stationary and ergodic Markov process where U i ∈ U and U ⊆ R d . Then if E[g(U)] exists and is finite, where I α,n is defined in Definition 2 and the empirical average is considered to be zero when |I α,n | = 0.
Proof. Consider W 1 , . . . , W n generated i.i.d. and W i ∼ Bernoulli(α). From the definition of I α,n , we can write the summation equivalently as Since the W i 's are independent of g(U i ), the pairs (W i , g(U i )) are also stationary and ergodic Markov, so from Birkhoff's ergodic theorem, and since E Wg On the other hand, from the strong law of large numbers: From (A4) and (A5), and since the summation in (A5) is bounded, and the proof is complete.
Using Lemma A1, the proof of Proposition 1 becomes trivial by letting U i = (X i , Y i , Z i ) since the triple is a sample of a jointly stationary ergodic Markov process. Noting that |I α,n | = B α joint concludes the proof of the Proposition.

Appendix B. Proof of Proposition 2
To show the convergence of the empirical average over samples in the product batch, we begin by reviewing convergence results for k-NN regression. Lemma A2 ([39] (Theorem 2-a)). Consider the sequence {(U i , V i )} n i=1 is stationary and geometrically φ-mixing (see Definition 4). If k(n) n → 0 and k(n) (log n) 2 → ∞, then Now to extend Lemma A2 to the case where the samples are randomly selected for the regression, we show the following lemmas.
be generated under Assumptions 1-3. If k(n) n → 0 and k(n) (log n) 2 → ∞, and for any y ∈ Y, E P X|Z g(X, y, Z) | Z = z exists and is finite, then we have that, for all y: g(X r j , y, z), and r j refers to the index of the j-th nearest neighbor of z among {Z i } n i=1 . Proof. The proof follows directly from Lemma A2 as y is fixed in (A7).
be generated under Assumptions 1-3. Then, if k(n) and s(n) fulfill the assumptions in (25), and for any y ∈ Y, E P X|Z g(X, y, Z) | Z = z exists and is finite, for all y: sup z ḡ(y, z, W s(n) ) − E P X|Z g(X, y, Z) | Z = z a.s.
Proof. See Appendix C.
where s(n) < n and the convergence occurs according to the random variables Y s(n) , Z s(n) , W s(n) , and the sequence.
and the proof of Lemma A5 is concluded.
Now that the required tools were introduced, we can continue the proof of Proposition 2. From Definition 3 and (A9), the LHS of (27) can be expressed as below: Let us define: and from Lemma A5, we obtain that: a.s.
As a result we can show the following strong convergence: where (A26) holds since s(n) → ∞ by (25) is stationary and ergodic, using Birkhoff's ergodic theorem we have: As W is generated independently To complete the proof, note that Therefore, from (A24) and (A29)-(A32), and B α ,s prod = k(n) I α ,s(n) we conclude that: a.s.
and the proof is complete.

Appendix C. Proof Lemma A4
According to Definition 3, the index set I α ,s(n) is determined by the sequence W s(n) . Therefore, A α ,k(n),n,s(n) (z, Z n , W s(n) ) denotes the set of indices of the k(n) nearest neighbors of z among {Z i | i ∈ I c α ,s(n) }, unlike in Lemma A3 where the neighbors can be chosen among the whole sequence {Z i } n i=1 . Hence, the first step is to verify the φ-mixing condition for the isolated k-NN method where some indices are excluded.
is also φ-mixing since the random jumps make the asymptotic independence (see Definition 4) happen with a faster rate. Nonetheless, we can show that the sequence for some integer t < n and real a 4 such that a 1 a 4 t ≤ 1/4. In order to show a similar inequality for {U i } i∈I c α ,s(n) , we have that: where both terms in (A35) are bounded with exponential terms and can be dominated by either of them. Thus, as n → ∞ and s(n) → ∞ (by assumption (25)) both terms tend to zero and Collomb's inequality applies to the summation over the sub-sequence of samples remained after the isolation. In other words, the required mixing condition holds for the new sequence {(X i , Y i , Z i )} i∈I c α ,s(n) and the result in Lemma A2 can be extended to this Lemma.
Next it remains to verify the conditions of Lemma A2 on k(n). From (25) Therefore, the conditions of Lemma A2 hold and from Lemma A3 it follows that for all y ∈ Y: sup z ḡ(y, z, W s(n) ) − E P X|Z g(X, y, Z) | Z = z a.s.
which concludes the proof of the Lemma.