Next Article in Journal
Detection of Coronary Artery Disease Using Multi-Domain Feature Fusion of Multi-Channel Heart Sound Signals
Previous Article in Journal
A Thermodynamic Approach to Measuring Entropy in a Few-Electron Nanodevice
Previous Article in Special Issue
Discovering Higher-Order Interactions Through Neural Information Decomposition
 
 
Due to planned maintenance work on our platforms, there might be short service disruptions on Saturday, December 3rd, between 15:00 and 16:00 (CET).
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Neural Estimator of Information for Time-Series Data with Dependency

1
School of Electrical Engineering and Computer Science (EECS), KTH Royal Institute of Technology, 100 44 Stockholm, Sweden
2
Ericsson Research, 164 83 Stockholm, Sweden
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(6), 641; https://doi.org/10.3390/e23060641
Received: 26 April 2021 / Revised: 15 May 2021 / Accepted: 18 May 2021 / Published: 21 May 2021
(This article belongs to the Special Issue Deep Artificial Neural Networks Meet Information Theory)

Abstract

:
Novel approaches to estimate information measures using neural networks are well-celebrated in recent years both in the information theory and machine learning communities. These neural-based estimators are shown to converge to the true values when estimating mutual information and conditional mutual information using independent samples. However, if the samples in the dataset are not independent, the consistency of these estimators requires further investigation. This is of particular interest for a more complex measure such as the directed information, which is pivotal in characterizing causality and is meaningful over time-dependent variables. The extension of the convergence proof for such cases is not trivial and demands further assumptions on the data. In this paper, we show that our neural estimator for conditional mutual information is consistent when the dataset is generated with samples of a stationary and ergodic source. In other words, we show that our information estimator using neural networks converges asymptotically to the true value with probability one. Besides universal functional approximation of neural networks, a core lemma to show the convergence is Birkhoff’s ergodic theorem. Additionally, we use the technique to estimate directed information and demonstrate the effectiveness of our approach in simulations.

1. Introduction

In recent decades, a tremendous effort has been done to explore capabilities of feed-forward networks and their application in various areas. Novel machine learning (ML) techniques go beyond conventional classification and regression tasks and enable revisiting well-known problems in fundamental areas such as information theory. The functional approximation power of neural networks is a compelling tool to be used for estimating information-theoretic quantities such as entropy, KL-divergence, mutual information (MI), and conditional mutual information (CMI). As an example, MI is estimated with neural networks in [1] where numerical results show notable improvements compared to the conventional methods for high-dimensional, correlated data.
Information-theoretic quantities are characterized by probability densities and most classical approaches aim at estimating the densities. These techniques may vary depending on whether the random variables are discrete or continuous. In this paper, we focus on continuous random variables. Examples of conventional non-parametric methods to estimate these quantities are histogram and partitioning techniques, where the densities are approximated and plugged-in into the definitions of the quantities, or methods based on the distance of the k-th nearest neighbor [2]. Despite vast applications of nearest neighbor methods for estimation of information-theoretic quantities, such as the proposed technique in [3], recent studies advocate using neural networks while simulations demonstrate that the accuracy of the estimations improves in several scenarios [1,4]. In particular, the results indicate that by increasing the dimension of the data, the bias of the estimation deteriorates less with neural estimators. In addition to superior performance, a neural estimator of information can be considered to be a stand-alone block and coupled in a larger network. The estimator can then be trained simultaneously with the rest of the network and measure the flow of information among variables of the network. Therefore, it facilitates the implementation of ML setups with constraints on information measures (e.g., information bottleneck [5] and representation learning [6]). These compelling features motivate exploring the benefits of neural networks to estimate other information measures and more complex data structures.
The cornerstone of neural estimators for MI is to approximate bounds on the relative entropy instead of computing it directly. These bounds are referred to as variational bounds and recently have gained attention due to their applications in ML problems. Examples are the lower bounds proposed originally in [7] by Donsker and Varadhan, and in [8] by Nguyen, Wainwright, and Jordan that are referred to as DV bound and NWJ bound, respectively. Several variants of these bounds have been reviewed in [9]. Variational bounds are tight, and the estimators proposed in [1,4,10,11] leverage this property and use neural networks to approximate the bounds and correspondingly the desired information measure. These estimators were shown to be consistent (i.e., the estimation converges asymptotically to the true value) and suitably estimate MI and CMI when the samples are independently and identically distributed (i.i.d.). However, in several applications such as time series analysis, natural language processing, or estimating information rates in communication channels with feedback, there exists a dependency among samples in the data. In this paper, we investigate analytically the convergence of our neural estimator and verify the performance of the method in estimating several information quantities.
Consider several random processes such that their realizations are dependent in time. In addition to common information-theoretic measures such as MI and CMI, more complex quantities can be studied that are paramount in representing these processes. For instance, the (temporal) causal relationship between two random processes has been expressed with quantities such as directed information (DI) [12,13] and transfer entropy (TE) [14]. Both DI and TE have a variety of applications in different areas. In communication systems, DI characterizes the capacity of a channel with feedback [15], while it has several other applications in venues including portfolio theory [16], source coding [17], and control theory [18] where DI is exploited as a measure of privacy in a cloud-based control setup. Additionally, DI was introduced as a measure of causal dependency in [19] which led to a series of works in that direction with applications in neuroscience [20,21] and social networks [22,23]. TE is also a well-celebrated measure in neuroscience [24,25], and the physics community [26,27] to quantify causality for time series. In this paper, we investigate capability of the neural estimator proposed in [11] to be used when the samples in the data are not generated independently.
Conventional approaches to estimate KL-divergence and MI such as nearest neighbor methods can be used for non-i.i.d. data; for example to estimate DI [28] and TE [29,30]. However, it is possible to leverage the benefits of neural estimators highlighted in [1] even though the data are generated from a source with dependency among its realizations. In a recent work [31], the authors estimate TE using the neural estimator for CMI introduced in [4]. Additionally, recurrent neural networks (RNN) are proposed in [32] to capture the time dependency to estimate DI. However, showing convergence of these estimators requires further theoretical investigation. Although the neural estimators are shown to be consistent in [1,4,11] for i.i.d. data, the extension of the proofs to dependent data needs to be addressed. In [32], the authors address the consistency of the estimation of DI by referring to universal approximation of RNN [33] and Breiman’s ergodic theorem [34]. Because RNNs are more complicated to be implemented and tuned, in this paper, we assume simple feed-forward neural networks, which were also proposed in [1,4,11] and in this paper. A conventional step to go beyond i.i.d. processes is to investigate stationary and ergodic Markov processes which have numerous applications in modeling real-world systems. Many convergence results for i.i.d. data such as the law of large numbers can be extended to ergodic processes; however, this generalization is not always trivial. The estimator proposed in [11] exhibits major improvements in estimating the CMI. Nevertheless, it is based on a k-nearest neighbors (k-NN) sampling technique which makes the extension of the convergence proofs to non-i.i.d. data more involved. The main contribution of this paper is to provide convergence results and consistency proofs for this neural estimator when the data are stationary and ergodic Markov.
The paper is organized as follows. Notations and basic definitions are introduced in Section 2. Then, in Section 3, the neural estimator and procedures are explained. Additionally, the convergence of the estimator is studied when the data are generated from a Markov source. Next, we provide simulation results in Section 4 for synthetic scenarios and verify the effectiveness of our technique in estimating CMI and DI. Finally, we conclude the paper in Section 5 and suggest potential future directions.

2. Preliminaries

We begin by describing the notation used throughout the paper, and the main definitions are explained afterwards. Then we review variational bounds which are the basis of our neural estimator.

2.1. Notation

Random variables and their realizations are denoted by capital and lower case letters, respectively. Given two integers i and j, a sequence of random variables X i , X i + 1 , , X j is shown as X i j , or simply X j when i = 1 . For a stochastic processes Z , a randomly generated sample is denoted by random variable Z. We indicate sets with calligraphic notation (e.g., X ). The space of d-dimensional real vectors is shown as R d . The probability density function (PDF) of a random variable X at X = x is denoted by p X ( x ) or equivalently p ( x ) , and the distribution of X, by P X or simply P. The PDF of multiple random variables X 1 , , X i is p X 1 X i ( x 1 , , x i ) and for simplicity it is represented by p ( x 1 , , x i ) in the paper. For the distribution P, E P [ · ] denotes the expectation with respect to its density p ( · ) . All the logarithms are in base e.
The convergence of the sequence X n almost surely (or with probability one) to X is denoted by X n a . s . X and is defined as:
P lim n X n = X = 1 .

2.2. Information Measures

The information-theoretic quantities of interest for this work can be written in terms of a KL-divergence, and the available neural estimators originally aim to estimate this quantity. For a random variable X with support X R d , the KL-divergence between two PDFs p ( x ) and q ( x ) is defined as:
D ( p ( x ) q ( x ) ) : = E P log p ( X ) q ( X ) .
Then, CMI can be defined using KL-divergence as below:
I ( X ; Y | Z ) : = D ( p ( x , y , z ) p ( x | z ) p ( y , z ) ) .
where Y and Z are random variables with support on Y and Z , which are subsets of R d . In this paper, we are focused on extending the estimators for CMI with non-i.i.d. data, where samples in time-series data might not be independently and identically distributed (e.g., generated from a Markov process); nonetheless, our method and consistency proofs are fairly general and can be applied for estimating KL-divergence as well. Consider a sequence of random samples { ( X i , Y i , Z i ) } i = 1 n generated from the joint process ( X , Y , Z ) , where the samples are not necessarily i.i.d.. A simple step toward this extension is to verify that the previous neural estimators, e.g., [11], can be used to estimate I ( X ; Y | Z ) , where ( X , Y , Z ) p ( x , y , z ) and the processes ( X , Y , Z ) are Markov, as in the following assumption.
Assumption 1.
( X , Y , Z ) are jointly stationary and ergodic 1-st order Markov with marginal density p ( x , y , z ) . The extension of the results to d-th order Markov is straightforward.
To explore further in generalizing the neural estimators, it is possible to investigate their capability for information measures that rely on dependent random variables. Consider the pairs { ( X i , Y i ) } i = 1 n to be samples of the processes ( X , Y ) . If the generated samples are dependent in time, it is possible to measure the causal relationship between the processes with quantities such as DI and TE, defined as below:
I ( X n Y n ) : = i = 1 n I ( X i ; Y i | Y i 1 )
T X Y ( i ) : = I ( X i J i 1 ; Y i | Y i L i 1 ) ,
where J and L are parameters of the TE that determine the length of memory to consider for X and Y , respectively. Both quantities are functions of the CMI and Figure 1 visualizes the corresponding variables in each CMI term for DI and TE. In particular, each CMI term in (3) quantifies the amount of shared information between X i and Y i conditioned on Y i 1 , i.e., it excludes the effect of the causal history of Y . In a general form, to express the causal effect of the process X on Y conditioning causally on Z , DI is normalized with respect to n which is defined below and denoted as directed information rate (DIR):
I ( X Y Z ) : = lim n 1 n I ( X n Y n Z n ) = lim n 1 n i = 1 n I ( X i ; Y i | Y i 1 , Z i ) .
By assuming the processes to be Markov, (5) can be simplified (see [23,35,36]). To be explicit, if both ( X , Y , Z ) and ( Y , Z ) are stationary and ergodic 1-st order Markov, from (5) the DIR can be simplified as:
I ( X Y Z ) = I ( X 2 ; Y 2 | Y 1 , Z 2 ) ,
where the CMI is with respect to the stationary density p ( x 2 , y 2 , z 2 ) of the Markov model. To generalize this approach, let us define the maximum Markov order ( o max ) of a set of processes to be the minimum number o such that the Markov order of the joint random variables of any subset of the processes is less than or equal to o. So if o max = l for ( X , Y , Z ) , then from (5) we can simplify the DIR term as:
I ( X Y Z ) = I ( X l + 1 ; Y l + 1 | Y l , Z l + 1 ) .
The following example shows how DIR can be computed for a linear data model, and emphasizes on the difference when DIR is conditioned causally on another process.
Example 1.
Consider the following linear model where { W i } i = 1 , { W i } i = 1 , and { W i } i = 1 are uncorrelated white Gaussian noises with variances σ x 2 , σ y 2 , and σ z 2 respectively:
X i = W i Y i = a Y i 1 + Z i 1 + W i Z i = X i + W i
for some | a | < 1 , and ( X 0 , Y 0 , Z 0 ) are distributed according to the stationary distribution of the processes X , Y , and Z . This model holds in Assumption 1 and o max = 1 , so I ( X Y ) can be computed as:
I ( X Y ) = I ( X 1 2 ; Y 2 | Y 1 ) = 1 2 log 1 + σ x 2 σ y 2 + σ z 2 ,
while from (7):
I ( X Y Z ) = I ( X 1 2 ; Y 2 | Y 1 , Z 1 2 ) = 0 .
As emphasized earlier, (7) holds when ( X , Y , Z ) and ( Y , Z ) are Markov with order l. Then the CMI estimators can be used potentially to estimate the DIR. However, the consistency of the estimation still needs to be investigated since the samples are not independent. Before introducing our technique, we review the basics for estimating information measures with neural networks.

2.3. Estimating the Variational Bound

The estimators proposed in [1,4,11] are all based on tight lower bounds on the KL-divergence, such as the DV bound, introduced in [7]:
D ( p ( x ) q ( x ) ) sup f F E P f ( X ) log E Q exp ( f ( X ) ) ,
where p and q are two PDFs defined over X with corresponding distributions P and Q, respectively, and F is any class of functions such that f : X R , and the two expectations exist and are finite. Consider a neural network with parameters θ Θ , then F can be to the class of all functions constructed with this neural network by choosing different values for the parameters θ . In more details, let f ( x ) to be the end-to-end function of a neural network with parameters θ Θ and the optimization in the right hand side (RHS) of (9) is equivalent to optimizing over Θ (as performed in [1]). Nevertheless, we can leverage from the fact that the DV bound is tight when the function is chosen as:
f * ( x ) = log p ( x ) q ( x ) x X .
Thus, the neural network can approximate f * ( x ) directly and the lower bound can be computed accordingly (as performed in [4,11]).
Definition 1.
For the PDFs p ( x , y , z ) and p ( x | z ) p ( y , z ) , define the corresponding distributions on X × Y × Z to be P ˜ and Q ˜ , respectively.
Since the CMI can be stated as a KL-divergence (2), the DV bound can be defined for CMI as bellow:
I ( X ; Y | Z ) sup f F E P ˜ f ( X , Y , Z ) log E Q ˜ exp ( f ( X , Y , Z ) ) ,
and the bound is tight by choosing
f * ( x , y , z ) = log p ( x , y , z ) p ( x | z ) p ( y , z ) x , y , z X × Y × Z .
The main barrier to compute this bound for f * ( x , y , z ) is that the densities are unknown. This challenge is addressed in [4,11] by proposing neural classifiers that can approximate f * ( x , y , z ) without knowing the densities. Below we review the steps of the estimation technique provided in [11]:
(1)
Construct the joint batch, containing samples generated according to p ( x , y , z ) .
(2)
Construct the product batch, containing samples generated according to p ( x | z ) p ( y , z ) .
(3)
Train the neural network with a particular loss function, which we explain later, to approximate f * ( x , y , z ) , i.e., the density ratio of p ( x , y , z ) p ( x | z ) p ( y , z ) .
(4)
Compute (11) using the batches and the approximated function.
To show the consistency of the estimation with this approach, it is crucial to verify if the empirical average with respect to each sample batch converges asymptotically to the corresponding expectations. Additionally, the neural network should be designed and trained to be capable of approximating the density ratio. For i.i.d. data samples, the authors in [4,11] provided the proofs in the form of concentration bounds. In this paper, we extend these proofs for non-i.i.d. data by providing convergence results for the special case of stationary and ergodic Markov processes. In the remainder of the paper, we denote the data by { ( X i , Y i , Z i ) } i = 1 n which are consecutive samples of the stationary Markov processes ( X , Y , Z ) with marginal PDF p ( x , y , z ) .

3. Main Results

In this section, we describe our proposed neural estimator in detail. To create the batches, the estimator is equipped with a k-NN sampling block such that the empirical average over the samples converges to the expected mean. Next, we describe the roadmap to show the convergence of the estimation to the true value (i.e., consistency analysis).

3.1. Batch Construction

To create the joint batch it is sufficient to take ( X i , Y i , Z i ) randomly from the available data. Below we define the joint batch formally using an auxiliary random variable that indicates whether an instance is selected or not (see also Algorithm 1 for the implementation).
Algorithm 1: Construction of the joint batch
Entropy 23 00641 i001
Definition 2 (Joint batch).
Let W i B e r ( α ) for i = 1 , , n be independent random variables, and I α , n ( W n ) : = { i i { 1 , , n } , W i = 1 } . Then B j o i n t α is defined as
B j o i n t α : = { ( X i , Y i , Z i ) i I α , n } ,
where we use I α , n to simplify the notation.
Please note that by the law of large numbers, the length of the joint batch is asymptotically α n . Next, to construct the product batch we use the method based on the k-NN technique, which is introduced in [11]. Below we define our method denoted by isolated k-NN technique, and explain how the product batch is constructed (see also Algorithm 2).
Algorithm 2: Construction of the product batch
Entropy 23 00641 i002
Definition 3 (Product batch).
For s < n , let W i B e r n o u l l i ( α ) for i = 1 , , s be independent random variables, and
I α , s ( W s ) : = { i i { 1 , , s } , W i = 1 } & I α , s c ( W s ) : = { 1 , , n } \ I α , s ( W s ) .
Then for any ζ Z and given the data { ( x i , y i , z i ) } i = 1 n , define A α , k , n , s ( ζ , z n , w s ) as the set of indices of the k nearest neighbors of ζ (by Euclidean distance) among { z i } for i I α , s c ( w s ) . Formally, let π : { 1 , , n s } I α , s c ( W s ) be a bijection such that ζ z π ( 1 ) 2 ζ z π ( n s ) 2 . Then, A α , k , n , s ( ζ , z n , w s ) : = { π ( 1 ) , , π ( k ) } . So the product batch can be defined as:
B p r o d α , s : = ( X j ( i ) , Y i , Z i ) i I α , s ( W s ) , j ( i ) A α , k , n , s ( Z i , Z n , W s ) .
Hereafter we use I α , s , I α , s c , and A α ( ζ ) instead as the remaining parameters can be understood from the context. We refer to this sampling technique as isolated k-NN in the sequel. An example is also provided in Figure 2 for the case of k = 2 .
Remark 1.
Here we emphasize that the isolated indices are selected from the first s indices of samples while the neighbors can be searched among all n indices of data except the ones in I α , s ( w s ) . Additionally, note that the length of the product batch is α s k asymptotically as n because s k also tends to ∞ as we see later in the assumptions of Proposition 2.

3.2. Training the Classifier

As explained earlier, the optimal function for a tight lower bound on the CMI is obtained by the density ratio and to compute that we use the functional approximation power of neural networks. Consider a feedforward neural network with the last layer equipped with the sigmoid function. The network is parameterized with θ Θ R h where h is the number of parameters, and the neural network function is denoted by ω θ : X × Y × Z [ 0 , 1 ] . For an input ( X , Y , Z ) of the network, let C { 0 , 1 } denote the class of the input which determines that the tuple is generated according to p ( x , y , z ) or p ( x | z ) p ( y , z ) . To be explicit, the input is either picked from the joint batch (class C = 1 ) or the product batch (class C = 0 ), and the goal is to learn the network parameters such that it can distinguish the class of new (unseen) queries. Let the loss function be the binary cross-entropy function. So for ω to be any function with inputs ( x , y , z ) and ranging between [ 0 , 1 ] , the expected loss is defined as:
L ( ω ) : = E C log ω ( X , Y , Z ) + ( 1 C ) log ( 1 ω ( X , Y , Z ) ) .
It is well-established that by minimizing L ( ω ) , the solution ω * would represent the probability of classifying the input in the class C = 1 given the input data, i.e., P C = 1 | x , y , z . In fact, as shown in [11] (Lemma 1) if the prior distribution on the classes is unbiased, by taking the derivative in (15) we have:
Γ ( x , y , z ) = p ( x , y , z ) p ( x | z ) p ( y , z ) = ω * ( x , y , z ) 1 ω * ( x , y , z ) .
So from (12) the optimal function can be expressed with Γ ( x , y , z ) as:
f * ( x , y , z ) = log Γ ( x , y , z ) x , y , z X × Y × Z .
Therefore, by training the neural network, we can approximate the optimal function f * ( x , y , z ) and estimate the lower bound for CMI.
Consider the neural network ω θ , then the empirical loss function is defined as:
L e m p ( ω θ ) : = 1 2 | B j o i n t α | ( X , Y , Z ) B j o i n t α log ω θ ( X , Y , Z ) 1 2 | B p r o d α , s | ( X , Y , Z ) B p r o d α , s log ( 1 ω θ ( X , Y , Z ) ) ,
and the optimal parameters are obtained by solving the following problem:
θ ^ : = arg min θ L e m p ( ω θ ) .
Consequently, we can approximate the density ratio Γ ( x , y , z ) from (16):
Γ ^ ( x , y , z ) = ω θ ^ ( x , y , z ) 1 ω θ ^ ( x , y , z ) .
To avoid having boundary values (i.e., ω θ ^ ( x , y , z ) close to zero or 1), the output of the neural network is clipped between [ τ , 1 τ ] for some small τ > 0 .
Remark 2.
Please note that Γ ^ ( x , y , z ) approximates the density ratio, if the batch sizes | B j o i n t α | and | B p r o d α , s | are balanced. Otherwise, (20) requires a correction coefficient (see [11]). To fulfill this, given the number of samples n, one can choose the parameters such that α n = α s k . Then, by the law of large numbers, the batches will asymptotically be balanced.

3.3. Estimation of the DV Bound

The final step in the estimation of CMI is to compute the lower bound (11) empirically using Γ ^ ( x , y , z ) . So by substituting the expectations with empirical averages with respect to samples in the joint and the product batch, the CMI estimator is defined as:
I ^ D V n ( X ; Y | Z ) : = 1 | B j o i n t α | ( x , y , z ) B j o i n t α log Γ ^ ( x , y , z ) + log 1 | B p r o d α , s | ( x , y , z ) B p r o d α , s Γ ^ ( x , y , z ) .
In practice, to mitigate the induced inaccuracy due to sampling from the original data, the training and estimation is repeated for several sampling trials. The steps for implementing the estimator are described in Algorithm 3. In the next part, we provide the convergence results for our estimator to validate substitution of the expectations in (11) with empirical averages with respect to the joint and the product batch. Then we show the convergence of the overall estimation to the true CMI value.
Algorithm 3: Estimation of CMI
Entropy 23 00641 i003

3.4. Consistency Analysis

The consistency of our neural estimator (i.e., showing that the estimator converges to its true value) is based on the universal functional approximation power of neural networks and concentration results for the samples collected in the joint batch and in the product batch using the isolated k-NN. Informally, Hornik’s functional approximation theorem [37] guarantees that feedforward neural networks are capable of fitting any continuous function. So depending on the true density of the data, there exists a choice of parameters θ ˜ that enables approximating the desired function with any arbitrary accuracy. Next, we show that the empirical loss function L e m p ( ω θ ) is concentrated around its mean L ( ω θ ) for any θ . Combining these tools, we are able to minimize the empirical loss function as in (19) and we expect θ ^ to be close to θ ˜ asymptotically; thus, eventually Γ ^ ( x , y , z ) properly approximates Γ ( x , y , z ) . Additionally, the empirical computation of the DV bound is concentrated around the expected value which concludes the consistency of the end-to-end estimation of the CMI.
In this paper, we put the main focus on extending the concentration results provided in [11] (Proposition 1) with Markov assumption on data. Although conventionally many asymptotic results for i.i.d. data are assumed to hold for Markov data as well, the required extensions here are more involved due to the additional complexity of the k-NN method. In the following, we first show the convergence of the empirical average for the joint batch,
| B j o i n t α | 1 ( X , Y , Z ) B j o i n t α g ( X , Y , Z ) E P ˜ [ g ( X , Y , Z ) ] ,
where g ( · ) is any measurable function such that the expectation exists and is finite. As the product batch collects samples corresponding to the k nearest neighbors, convergence results for nearest neighbor regression are invoked to show that the empirical average for the product batch converges to the expectation with respect to the product distribution Q ˜ ,
| B p r o d α , s | 1 ( X , Y , Z ) B p r o d α , s g ( X , Y , Z ) E Q ˜ [ g ( X , Y , Z ) ] .
Then, we conclude the consistency of the overall estimation.

3.4.1. Convergence for the Joint Batch

One well-known extension to the law of large numbers for non-i.i.d. processes is Birkhoff’s ergodic theorem, and is the basis of our proof to show the following proposition on the convergence of the sample average over the joint batch.
Proposition 1.
Consider the sequence of random variables { ( X i , Y i , Z i ) } i = 1 n generated under Assumption 1. Consider the distribution P ˜ in Definition 1, for any measurable function g ( · ) such that E P ˜ g ( X , Y , Z ) exists and is finite,
1 | B j o i n t α | ( X , Y , Z ) B j o i n t α g ( X , Y , Z ) a . s . E P ˜ g ( X , Y , Z ) .
Proof. 
See Appendix A. ☐

3.4.2. Convergence for the Product Batch

From Definition 3, the empirical summation over all samples in the product batch is equivalent to averaging | I α , s | times k-NN regressions. Considering a sequence of pairs { ( U i , V i ) } i = 1 n generated from stationary ergodic processes ( U , V ) , the k-NN regression denotes the problem of estimating m ( u ) : = E [ V | U = u ] with m n ( u ) : = 1 k ( n ) j = 1 k ( n ) V r j where r j refers to the j-th nearest neighbor of u among U 1 , , U n . This problem has been well studied when the pairs ( U i , V i ) are generated i.i.d.. For example in [38], the authors show the convergence of m n ( u ) as:
P | m n ( u ) m ( u ) | p ( u ) d u ϵ exp ( n a ϵ 2 ) ,
for some positive constant a, when k ( n ) and k ( n ) n 0 . However, if the pairs are not independent, convergence results require a more advanced condition denoted geometric ϕ -mixing condition or geometric ergodicity condition [39,40]. As argued in [39], the geometric ergodicity is not a restrictive statement and holds for a wide range of processes (see also [41]). For instance, linear autoregressive processes are geometrically ergodic [41] (Ch. 15.5.2). Below we review the ϕ -mixing condition.
Definition 4
( ϕ -mixing condition). A process U is ϕ-mixing if for a sequence { ϕ n } n N of positive numbers satisfying ϕ n 0 as n , for any integer i > 0 we have:
| P ( A B ) P ( A ) P ( B ) | ϕ i P ( A ) ,
for all n > 0 and all sets A and B which are members of σ ( U 1 , , U n ) and σ ( U n + i , U n + i + 1 , ) , respectively. If { ϕ n } is a geometric sequence, U is called geometrically ϕ-mixing.
To show the convergence of the empirical average over the product batch, we make the following assumptions.
Assumption 2.
The sequence { ( X i , Y i , Z i ) } i = 1 n is geometrically ϕ-mixing.
Assumption 3.
We assume that Y and Z are compact.
Proposition 2.
Let the sequence of random variables { ( X i , Y i , Z i ) } i = 1 n be generated under Assumptions 1–3, and we choose k ( n ) and s ( n ) such that:
s ( n ) k ( n ) = n k ( n ) s ( n ) k ( n ) / ( log n ) 2 .
Consider Q ˜ defined in Definition 1. Then, for any function g ( · ) such that E Q ˜ g ( X , Y , Z ) exists and is finite, and additionally,
| g ( x , y 1 , z ) g ( x , y 2 , z ) | < L g | y 1 y 2 | x X , z Z , y 1 , y 2 Y ,
where L g > 0 is the Lipschitz constant, we have that:
1 | B p r o d α , s | ( X , Y , Z ) B p r o d α , s g ( X , Y , Z ) a . s . E Q ˜ g ( X , Y , Z ) .
Proof. 
See Appendix B. ☐
Remark 3.
Examples of choices for k ( n ) and s ( n ) satisfying (25) are for instance k ( n ) = n 1 2 and k ( n ) = ( log n ) 2 + ϵ for some ϵ > 0 . Please note that in [11], the consistencies are shown when k ( n ) = Θ ( n 1 2 ) . However, the convergence result in [11] (Theorem 1) is an explicit bound, so the condition on k ( n ) can be relaxed (choosing a smaller k ( n ) ) when we are only interested in the asymptotic behavior.

3.4.3. Convergence of the Overall Estimation

To complete our analysis on the consistency of the neural estimator, it is required to show that the loss function is properly approximated and it converges to the optimal loss as n increase. The following assumptions on the neural network and the densities enable us to show this convergence.
Assumption 4.
For a network ω θ parameterized with θ Θ , the assumption holds if Θ is closed, Θ { θ | θ 2 K } for some constant K > 0 and ω θ is B-Lipschitz, for some constant B > 0 , regarding θ, for all ( x , y , z ) , i.e.,
| ω θ 1 ( x , y , z ) ω θ 2 ( x , y , z ) | B θ 1 θ 2 2 , θ 1 , θ 2 Θ , ( x , y , z ) X × Y × Z .
Assumption 5.
There exist 0 < p min < p max < such that for all x , y , z X × Y × Z , the values of p ( x , y , z ) and p ( x | z ) p ( y , z ) are both in the interval [ p min , p max ] , and it holds that
p min p max + p min τ ,
to guarantee that τ ω * 1 τ .
The following theorem concludes the consistency of the end-to-end estimator.
Theorem 1.
Let Assumptions 1, 2, 3, 4, and 5 hold and k ( n ) and s ( n ) satisfy (25). Then the CMI estimator I ^ D V n ( X ; Y | Z ) (defined in (21)), converges strongly to I ( X ; Y | Z ) , i.e.,
I ^ D V n ( X ; Y | Z ) a . s . I ( X ; Y | Z ) .
Proof. 
See Appendix D. ☐
In the next section, we apply our estimator in several synthetic scenarios to verify its capability in estimating CMI and DI.

4. Simulation Results

In this section, we experiment with our proposed estimator of CMI and DI in the following auto-regressive model which is widely used in different applications, including wireless communications [42], defining causal notions in econometrics [43], and modeling traffic flow [44], among others:
X i Y i Z i = A X i Y i Z i + B X i 1 Y i 1 Z i 1 + N i x N i y N i z ,
where A and B are 3 × 3 matrices and the rest of variables are d-dimensional row vectors. A models the instantaneous effect of X i , Y i , and Z i on each other and its diagonal elements are zero, while B models the effect of previous time instance. N i x , N i y , and N i z (denoted as noise in some contexts) are independent and generated i.i.d. according to zero-mean Gaussian distributions with covariance matrices σ x 2 I d , σ y 2 I d , and σ z 2 I d , respectively (i.e., the dimensions are d and components are uncorrelated). Please note that this model fulfills Assumptions 1 and 2 by setting appropriate initial random variables. Although the Gaussian random variables do not range in a compact set and thus, Assumption 3 does not hold, we could use truncated Gaussian distributions. Such adjustment does not significantly change the statistics of the generated dataset since the probability of finding a value far away from the mean is negligible.
In the following section, we test the capability of our estimator in estimating both conditional mutual information (CMI) and directed information (DI). In both cases, n samples are generated from the model and the estimations are performed according to Algorithms 1 and 2. Then according to Algorithm 3, the joint and product batches are split randomly in half to construct train and evaluation sets. Then the parameters of the classifier are trained with the train set and the final estimation is computed with the evaluation set (Codes are available at https://github.com/smolavipour/Neural-Estimator-of-Information-non-i.i.d, accessed on 20 May 2021).
To verify the performance of our technique, we also compared it with the approach taken in [4,31] which is as follows. Conditional mutual information can be computed by subtracting two mutual information terms, i.e.,
I ( X ; Y | Z ) = I ( X ; Y , Z ) I ( X ; Z ) .
So instead of estimating the CMI term directly, one can use a neural estimator such as the classifier based estimator in [4] or the MINE estimator [1], and estimate each MI term in (31) to estimate the CMI. In what follows, we refer to this technique as MI-diff since it computes the difference between two MI terms.

4.1. Estimating Conditional Mutual Information

In this scenario, we estimate I ( X 1 ; Y 1 | Z 1 ) when A and B are chosen to be:
A = 0 1 0 0 0 0 0 0 0 , B = 0 1 0 0 0 1 0 0 0 .
Then from (30), the CMI can be computed as below:
I ( X 1 ; Y 1 | Z 1 ) = h ( X 1 | Z 1 ) h ( X 1 | Y 1 , Z 1 ) = h ( Y 0 + Y 1 + N 1 x | Z 1 ) h ( Y 0 + N 1 x | Y 1 , Z 1 ) = h ( Y 0 + Y 1 + N 1 x ) h ( Y 0 + N 1 x ) = d 2 log 1 + σ y 2 + σ z 2 σ x 2 + σ y 2 + σ z 2 .
Each estimated value is an average of T = 20 estimations, where in each round the batches are re-selected while having a fixed dataset. This procedure is repeated for 10 Monte Carlo trials and the data are re-generated for each trial. The hyper-parameters and settings of the experiment are provided in Table 1. In Figure 3, the CMI is estimated (as I ^ D V n , T ( X 1 ; Y 1 | Z 1 ) in Algorithm 3) with n = 2 × 10 4 samples with dimension d = 1 when σ y = 2 , σ z = 2 and by varying σ x . It can be observed that the estimator can properly estimate the CMI while the variance of the estimation is also small. The latter can be inferred from the shaded region, which indicates the range of estimated CMI for a particular σ x over all Monte Carlo trials. Next, the experiment is repeated for d = 10 and the results are depicted in Figure 4, where we compare our estimation of CMI with the MI-diff approach, which is explained in (31) and each MI term is estimated with the classifier-based estimator proposed in [4]. It can be observed that the means of both estimators are similar; nonetheless, estimating the CMI directly is more accurate and has less variation compared to the MI-diff approach. Additionally, our method is faster since it computes the information term only once, while in the MI-diff approach, two different classifiers are trained to estimate each MI term.

4.2. Estimating Directed Information

DI can explain the underlying causal relationship among processes. This notion has wide applications in various areas. For example, consider a social network where the activities of users are monitored (e.g., the messages times as studied in [23]). The DI between these time-series data expresses how the activity of one user can affect the activity of the others. In addition, to such data analytic applications, DI characterizes the capacity of communication channels with feedback and by estimating the capacity, rates and powers of transmission can be adjusted in radio communications (see for example [32]). Now in this experiment, consider a network of three processes X , Y , and Z , such that the time-series data are modeled with (30) with d = 1 where
A = 0 , B = 0 0 0 b 1 0 0 0 b 2 0 .
In this model, where the relations are depicted in Figure 5, the process X is affecting Y with a delay and similarly the signal of Y appears on Z in the next time instance while an independent noise is accumulated on both steps. The DIR from X Y in this network can be computed as follows:
I ( X Y ) = lim n 1 n i = 1 n I ( X i ; Y i | Y i 1 ) = lim n 1 n i = 1 n H ( Y i | Y i 1 ) H ( Y i | X i , Y i 1 ) = lim n 1 n i = 1 n H ( Y i ) H ( Y i | X i 1 ) = 1 2 log 1 + b 1 2 σ x 2 σ y 2 .
Similarly, for the link Y Z , we have:
I ( Y Z ) = 1 n i = 1 n I ( Y i ; Z i | Z i 1 ) = 1 n i = 1 n H ( Z i | Z i 1 ) H ( Z i | Y i , Z i 1 ) = 1 n i = 1 n H ( Z i ) H ( Z i | Y i 1 ) = 1 2 log 1 + b 1 2 b 2 2 σ x 2 + b 2 2 σ y 2 σ z 2 .
Next we can compute the true DIR for the link X Z as:
I ( X Z ) = lim n 1 n i = 1 n I ( X i ; Z i | Z i 1 ) = lim n 1 n i = 1 n H ( Z i | Z i 1 ) H ( Z i | X i , Z i 1 ) = lim n 1 n i = 1 n H ( Z i ) H ( Z i | X i 2 ) = 1 2 log 1 + b 1 2 b 2 2 σ x 2 b 2 2 σ y 2 + σ z 2 .
Please note that the DIR corresponding to other links (i.e., the above links in the reverse direction) is zero by similar computations. Suppose we represent the causal relationships with a directed graph, where a link between two nodes exists if the corresponding DIR is non-zero. Then according to (34)–(36), the causal relationships are described with the graph of Figure 6a.
To estimate the DIR, note that the processes are Markov and the maximum Markov order ( o max ) for the set of all processes is o max = 2 according to (30) and (33). Hence by (7), we can estimate DIR with the CMI estimator. For instance the DIR for processes ( X , Y ) can be obtained by:
I ^ D V n ( X Y ) : = I ^ D V n ( X 3 ; Y 3 | Y 2 ) ,
where the right hand side is computed similar to (21). We performed the experiment with n = 2 × 10 5 samples of dimension d = 1 generated according to the model (30) and (33) with b 1 = 1 , b 2 = 2 , σ x = 3 , σ y = 2 , and σ z = 1 , while the settings of the neural network were chosen as in Table 1. The estimated values are stated in Table 2. It can be seen that the bias of the estimator is fairly small while the variance of the estimations is negligible. This is inline with the observations in [11] when estimating CMI for i.i.d. case.
Although I ( X Z ) > 0 , intuitively X is only affecting Z causally through Y , which suggests that I ( X Z Y ) = 0 . This event is referred to as proxy effect when studying directed information graph (see [45]). In fact the graphical representation of causal relationships can be simplified using the notion of causally conditioned DIR as depicted in Figure 6b. To see this formally, note that from (30) it yields that:
I ( X Z Y ) = lim n 1 n i = 1 n I ( X i ; Z i | Y i , Z i 1 ) = lim n 1 n i = 1 n H ( Z i | Y i , Z i 1 ) H ( Z i | X i , Y i , Z i 1 ) = lim n 1 n i = 1 n H ( Z i | Y i 1 ) H ( Z i | Y i 1 ) = 0 .
Considering o max = 2 , the causally conditioned DIR terms can be estimated with our CMI estimator according to (7); for instance,
I ^ D V n ( X Y Z ) : = I ^ D V n ( X 3 ; Y 3 | Y 2 , Z 3 ) .
The estimation results are provided in Table 3 for all the links, where for each link we averaged over T = 20 estimations (as in Algorithm 3); then the procedure is repeated for 10 Monte Carlo trials in which we generate a new dataset according to the model.
In this experiment, we did not explore the effect of higher dimensions for data, although one should note that for the causally conditioned DIR estimation, with d = 1 the neural network is fed with data of size 9. Nevertheless, the performance of higher dimensions for this estimator with i.i.d. data has been studied in [11] and the challenges of dealing with high dimensions when data has dependency can be considered to be a future direction of this work. Additionally, although the information about o max may not always be available in practice, it can be approximated by data-driven approaches similar to the method described in [45].

5. Conclusions and Future Directions

In this paper, we explored the potentials of a neural estimator for information measures when there exist time dependencies among the samples. We extended the analysis on the convergence of the estimation and provided experimental results to show the performance of the estimator in practice. Furthermore, we compared our estimation method with a similar approach taken in [4,31] (which we denoted as MI-diff), and demonstrations on synthetic scenarios show that the variances of our estimations are smaller. However, the main contribution is the derivation of proofs of convergence when the data are generated from a Markov source. Our estimator is based on a k-NN method to re-sample the dataset such that the empirical average over the samples converges to the expectation with certain density. The convergence result derived for the re-sampling technique is stand-alone and can be adopted in other sampling application.
Our proposed estimator can be used potentially in the areas of information theory, communication systems, and machine learning. For instance, the capacity of channels with feedback can be characterized with directed information and estimated with our estimator and can be investigated as a future direction. Furthermore, in machine learning applications where the data has some form of dependency (either spatial of temporal), regularizing the training with information flow requires the estimator of information to capture causality which is considered in our technique. Finally, information measures can be used in modeling and controlling a complex system and the results in this work can provide meaningful measures such as conditional dependence and causal influence.

Author Contributions

Conceptualization, S.M.; methodology, S.M., H.G., and G.B.; software, S.M.; validation, S.M., H.G., G.B. and M.S.; formal analysis, S.M., H.G., and G.B.; investigation, S.M. and G.B.; resources, M.S.; data curation, S.M.; writing—original draft preparation, S.M. and H.G.; writing—review and editing, S.M., H.G., G.B. and M.S.; visualization, S.M.; supervision, G.B. and M.S.; project administration, M.S.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research, and the Swedish Research Council under contract 2019-03606.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
PDFProbability density function
IIDIndependent and identically distributed
MIMutual information
CMIConditional mutual information
DIDirected information
DIRDirected information rate
TETransfer entropy
DVDonsker-Varadhan
NWJNguyen-Wainwright-Jordan
k-NNk nearest neighbors
MLMachine learning
RNNRecurrent neural network

Appendix A. Proof of Proposition 1

To show the convergence stated in the Proposition, let us first introduce the following lemma which is a variant of Birkhoff’s ergodic theorem for the case where the samples are not necessarily subsequent.
Lemma A1.
Let U n be n observations of a stationary and ergodic Markov process where U i U and U R d . Then if E [ g ( U ) ] exists and is finite,
1 | I α , n | j I α , n g ( U j ) a . s . E [ g ( U ) ] ,
where I α , n is defined in Definition 2 and the empirical average is considered to be zero when | I α , n | = 0 .
Proof. 
Consider W 1 , , W n generated i.i.d. and W i B e r n o u l l i ( α ) . From the definition of I α , n , we can write the summation equivalently as
j I α , n g ( U j ) = i = 1 n W i g ( U i ) .
Since the W i ’s are independent of g ( U i ) , the pairs ( W i , g ( U i ) ) are also stationary and ergodic Markov, so from Birkhoff’s ergodic theorem,
1 n i = 1 n W i g ( U i ) E W g ( U ) a . s . 0 ,
and since E W g ( U ) = E W E g ( U ) = α E g ( U ) ,
1 n i = 1 n W i g ( U i ) a . s . α E g ( U ) .
On the other hand, from the strong law of large numbers:
| I α , n | n = 1 n i = 1 n W i a . s . α .
From (A4) and (A5), and since the summation in (A5) is bounded,
1 | I α , n | j I α , n g ( U j ) a . s . E g ( U )
and the proof is complete. ☐
Using Lemma A1, the proof of Proposition 1 becomes trivial by letting U i = ( X i , Y i , Z i ) since the triple is a sample of a jointly stationary ergodic Markov process. Noting that | I α , n | = | B j o i n t α | concludes the proof of the Proposition.

Appendix B. Proof of Proposition 2

To show the convergence of the empirical average over samples in the product batch, we begin by reviewing convergence results for k-NN regression.
Lemma A2
([39] (Theorem 2-a)). Consider the sequence { ( U i , V i ) } i = 1 n is stationary and geometrically ϕ-mixing (see Definition 4). If k ( n ) n 0 and k ( n ) ( log n ) 2 , then
sup u | m n ( u ) m ( u ) | a . s . 0 .
Now to extend Lemma A2 to the case where the samples are randomly selected for the regression, we show the following lemmas.
Lemma A3.
Let { ( X i , Y i , Z i ) } i = 1 n be generated under Assumptions 1–3. If k ( n ) n 0 and k ( n ) ( log n ) 2 , and for any y Y , E P X | Z g ( X , y , Z ) Z = z exists and is finite, then we have that, for all y:
sup z | g ˜ ( y , z ) E P X | Z g ( X , y , Z ) Z = z | a . s . 0 ,
where
g ˜ ( y , z ) : = 1 k ( n ) j = 1 k ( n ) g ( X r j , y , z ) ,
and r j refers to the index of the j-th nearest neighbor of z among { Z i } i = 1 n .
Proof. 
The proof follows directly from Lemma A2 as y is fixed in (A7). ☐
Lemma A4.
Let { ( X i , Y i , Z i ) } i = 1 n be generated under Assumptions 1–3. Then, if k ( n ) and s ( n ) fulfill the assumptions in (25), and for any y Y , E P X | Z g ( X , y , Z ) Z = z exists and is finite, for all y:
sup z | g ¯ ( y , z , W s ( n ) ) E P X | Z g ( X , y , Z ) Z = z | a . s . 0 ,
where
g ¯ ( y , z , W s ( n ) ) : = 1 k ( n ) l A α , k ( n ) , n , s ( n ) ( z , Z n , W s ( n ) ) g ( X l , y , z ) ,
and A α , k ( n ) , n , s ( n ) ( z , Z n , W s ( n ) ) and W s ( n ) are defined in Definition 3.
Proof. 
See Appendix C. ☐
Lemma A5.
For the sequence { ( X i , Y i , Z i ) } i = 1 n defined in Lemma A4:
| g ¯ ( Y s ( n ) , Z s ( n ) , W s ( n ) ) E P X | Z g ( X , Y , Z ) Y = Y s ( n ) , Z = Z s ( n ) | a . s . 0 ,
where s ( n ) < n and the convergence occurs according to the random variables Y s ( n ) , Z s ( n ) , W s ( n ) , and the sequence.
Proof. 
To simplify the notation, we use g ¯ ( y , z ) instead of g ¯ ( y , z , W s ( n ) ) in this proof. Since Y is compact, for any ϵ > 0 , there exist M finite balls with radius ϵ / L g and centers y ˜ j for j = 1 , , M , that cover Y . Then, from the triangle inequality, we have:
P lim n sup y , z | g ¯ ( y , z ) E P X | Z [ g ( X , y , Z ) Z = z ] | 2 ϵ P lim n sup y , z | Δ ( 1 ) ( y , z ) | + | Δ ( 2 ) ( y , z ) | + | Δ ( 3 ) ( y , z ) | 2 ϵ ,
where
Δ ( 1 ) : = ( y , z ) g ¯ ( y , z ) g ¯ ( y ˜ j , z )
Δ ( 2 ) : = ( y , z ) g ¯ ( y ˜ j , z ) E P X | Z [ g ( X , y ˜ j , Z ) Z = z ]
Δ ( 3 ) ( y , z ) : = E P X | Z [ g ( X , y ˜ j , Z ) Z = z ] E P X | Z [ g ( X , y , Z ) Z = z ] ,
and y ˜ j is the center of the ball containing y. Note that
lim n sup y , z | Δ ( 1 ) ( y , z ) | + | Δ ( 2 ) ( y , z ) | + | Δ ( 3 ) ( y , z ) |
lim n sup y , z | Δ ( 1 ) ( y , z ) | + lim n sup y , z | Δ ( 2 ) ( y , z ) | + lim n sup y , z | Δ ( 3 ) ( y , z ) | [ 2 ]
2 ϵ + lim n sup y , z | Δ ( 2 ) ( y , z ) | ,
where (A16) follows from (26) and the radius of the balls being ϵ / L g . Thus (A11) yields:
P lim n sup y , z | g ¯ ( y , z ) E P X | Z [ g ( X , y , Z ) Z = z ] | 2 ϵ P lim n sup y , z | Δ ( 2 ) ( y , z ) | 0
P lim n max y ˜ j sup z | Δ ( 2 ) ( y ˜ j , z ) | 0
= P max y ˜ j lim n sup z | Δ ( 2 ) ( y ˜ j , z ) | 0
1 j = 1 M P lim n sup z | Δ ( 2 ) ( y ˜ j , z ) | > 0
= 1 ,
where (A17) holds by the definition (A13), (A18) follows since y ˜ j is independent of n, and the last step is due to Lemma A4. Finally since (A20) holds for any ϵ > 0 , according to [46] (Prop 1.13) it is concluded that:
P lim n sup y , z | g ¯ ( y , z ) E P X | Z [ g ( X , y , Z ) Z = z ] | = 0 = 1 .
Consider now the probability space ( Ω , F , P ) . For any y Y and z Z , g ¯ ( y , z ) can be expressed equivalently as g ¯ ( y , z ; ψ ) : Ω R . Consider the functions Y s ( n ) ( ψ ) : Ω Y and Z s ( n ) ( ψ ) : Ω Z , then from (A21):
P ψ Ω : lim n | g ¯ ( Y s ( n ) ( ψ ) , Z s ( n ) ( ψ ) ; ψ ) E P X | Z [ g ( X , Y , Z ) Y = Y s ( n ) ( ψ ) , Z = Z s ( n ) ( ψ ) ] | = 0 P ψ Ω : lim n sup y , z | g ¯ ( y , z ; ψ ) E P X | Z [ g ( X , y , Z ) Z = z ] | = 0 = 1 ,
which implies that:
| g ¯ ( Y s ( n ) , Z s ( n ) , W s ( n ) ) E P X | Z g ( X , Y , Z ) Y = Y s ( n ) , Z = Z s ( n ) | a . s . 0 ,
and the proof of Lemma A5 is concluded. ☐
Now that the required tools were introduced, we can continue the proof of Proposition 2. From Definition 3 and (A9), the LHS of (27) can be expressed as below:
1 k ( n ) | I α , s ( n ) | ( X , Y , Z ) B p r o d α , s g ( X , Y , Z ) = 1 | I α , s ( n ) | i = 1 s ( n ) W i g ¯ ( Y i , Z i , W s ( n ) ) .
Let us define:
Δ i : = g ¯ ( Y i , Z i , W s ( n ) ) E P X | Z g ( X , Y , Z ) Y = Y i , Z = Z i ,
and from Lemma A5, we obtain that:
| Δ s ( n ) | a . s . 0 .
As a result we can show the following strong convergence:
P lim n 1 s ( n ) i = 1 s ( n ) W i Δ i = 0 P lim n W s ( n ) Δ s ( n ) = 0
P lim n | Δ s ( n ) | = 0
= 1 ,
where (A26) holds since s ( n ) by (25) and using Cesáro mean ([47] (Theorem 4.2.3)), (A27) holds since W s ( n ) { 0 , 1 } , and the equality in the last step follows from (A25). In other words,
1 s ( n ) i = 1 s ( n ) W i g ¯ ( Y i , Z i , W s ( n ) ) 1 s ( n ) i = 1 s ( n ) W i E P X | Z g ( X , Y , Z ) Y = Y i , Z = Z i a . s . 0 .
Next since the sequence { ( W i , Y i , Z i ) } i = 1 s ( n ) is stationary and ergodic, using Birkhoff’s ergodic theorem we have:
1 s ( n ) i = 1 s ( n ) W i E P X | Z g ( X , Y , Z ) Y = Y i , Z = Z i a . s . E P W P Y Z W E P X | Z g ( X , Y , Z ) Y , Z .
As W is generated independently
E P W P Y Z W E P X | Z g ( X , Y , Z ) Y , Z = E [ W ] E Q ˜ g ( X , Y , Z ) .
To complete the proof, note that
| I α , s ( n ) | s ( n ) a . s . E [ W ] .
Therefore, from (A24) and (A29)–(A32), and | B p r o d α , s | = k ( n ) | I α , s ( n ) | we conclude that:
1 | B p r o d α , s | ( X , Y , Z ) B p r o d α , s g ( X , Y , Z ) a . s . E Q ˜ g ( X , Y , Z ) ,
and the proof is complete. ☐

Appendix C. Proof Lemma A4

According to Definition 3, the index set I α , s ( n ) is determined by the sequence W s ( n ) . Therefore, A α , k ( n ) , n , s ( n ) ( z , Z n , W s ( n ) ) denotes the set of indices of the k ( n ) nearest neighbors of z among { Z i i I α , s ( n ) c } , unlike in Lemma A3 where the neighbors can be chosen among the whole sequence { Z i } i = 1 n . Hence, the first step is to verify the ϕ-mixing condition for the isolated k-NN method where some indices are excluded. Intuitively, if { X i , Y i , Z i } i = 1 n is ϕ-mixing, then the sequence { ( X i , Y i , Z i ) } i I α , s ( n ) c is also ϕ-mixing since the random jumps make the asymptotic independence (see Definition 4) happen with a faster rate. Nonetheless, we can show that the sequence { ( X i , Y i , Z i ) } i I α , s ( n ) c satisfy the mixing condition for Lemma A3 which is expressed in the following.
The basis of the proof for Lemma A2 and thus Lemma A3, is Collomb’s inequality [48] (Theorem 2.2.1) which provides a concentration bound similar to Hoeffding’s inequality for ϕ-mixing variables. For instance if U is a ϕ -mixing process where E [ U i ] = 0 , | U i | a 1 , E [ U i 2 ] a 2 , and E [ | U i | ] a 3 , the inequality states that:
P | i = 1 n U i | > ϵ exp 3 e n ϕ t t a 4 ϵ + 6 a 4 2 n a 2 + 4 a 1 a 3 i = 1 t ϕ i ,
for some integer t < n and real a 4 such that a 1 a 4 t 1 / 4 . In order to show a similar inequality for { U i } i I α , s ( n ) c , we have that:
P | i I α , s ( n ) c U i | > ϵ = P | i = 1 n U i i = 1 s ( n ) W i U i | > ϵ P | i = 1 n U i | > ϵ / 2 + P | i = 1 s ( n ) W i U i | > ϵ / 2 ,
where both terms in (A35) are bounded with exponential terms and can be dominated by either of them. Thus, as n and s ( n ) (by assumption (25)) both terms tend to zero and Collomb’s inequality applies to the summation over the sub-sequence of samples remained after the isolation. In other words, the required mixing condition holds for the new sequence { ( X i , Y i , Z i ) } i I α , s ( n ) c and the result in Lemma A2 can be extended to this Lemma.
Next it remains to verify the conditions of Lemma A2 on k ( n ) . From (25) we have,
k ( n ) | I α , s ( n ) c | k ( n ) n s ( n ) = 1 s ( n ) ( 1 1 k ( n ) ) a . s . 0 ,
which yields that:
k ( n ) log | I α , s ( n ) c | 2 k ( n ) ( log n ) 2 a . s . .
Therefore, the conditions of Lemma A2 hold and from Lemma A3 it follows that for all y Y :
sup z | g ¯ ( y , z , W s ( n ) ) E P X | Z g ( X , y , Z ) Z = z | a . s . 0 ,
which concludes the proof of the Lemma. ☐

Appendix D. Proof Theorem 1

Based on the universal functional approximation theory of neural networks [37], ref. [4] (Lemma 4) implies that for any ϵ 0 > 0 , there exists θ ˜ Θ such that:
| L ( ω θ ˜ ) L * | < ϵ 0 2 ,
where L * : = L ( ω * ) and L ( ω ) and ω * were defined in (15). Moreover, from Propositions 1 and 2, for any θ Θ , the empirical loss L e m p ( ω θ ) defined in (18) converges asymptotically to the expected loss L ( ω θ ) . This is obtained by letting g ( x , y , z ) = log ( ω θ ( x , y , z ) ) and g ( x , y , z ) = log ( 1 ω θ ( x , y , z ) ) in Propositions 1 and 2, respectively, and noting Remark 2. Thus we have:
L e m p ( ω θ ) a . s . L ( ω θ ) .
Since Θ R h and θ 2 K , θ Θ , Θ can be covered with finite N ( Θ , r ) number of balls of radius r, where N ( Θ , r ) is bounded [49]:
N ( Θ , r ) 2 K h r h .
Let { θ 1 , , θ N ( Θ , r ) } denote the centers of the covering balls. Let j n be the index of the ball that θ ^ belongs to, then from the triangle inequality we have:
| L e m p ( ω θ ^ ) L ( ω θ ^ ) | | L e m p ( ω θ ^ ) L e m p ( ω θ j n ) | + | L e m p ( ω θ j n ) L ( ω θ j n ) | + | L ( ω θ j n ) L ( ω θ ^ ) | | L e m p ( ω θ j n ) L ( ω θ j n ) | + 2 B r τ
where the second inequality holds due to the Lipschitz continuity of ω θ stated in Assumption 4. From the union bound and for any ϵ > 0 , we have:
P lim n | L e m p ( ω θ ^ ) L ( ω θ ^ ) | > ϵ 2
N ( Θ , r ) P lim n | L e m p ( ω θ j n ) L ( ω θ j n ) | > ϵ 2 2 B r τ
= 0 ,
where (A42) holds due to (A41), applying a union bound over all centers θ j , and choosing r < ϵ τ 4 B , and the last step follows by exploiting the strong convergence in (A39). As a result, with probability one:
lim n L ( ω θ ^ ) lim n L e m p ( ω θ ^ ) + ϵ 2
lim n L e m p ( ω θ ˜ ) + ϵ 2
= L ( ω θ ˜ ) + ϵ 2
L * + ϵ ,
where (A44) is obtained from (A43), and (A45) holds since θ ^ minimizes L e m p ( ω θ ) , and (A46) follows from (A39). Finally, the last step is derived using (A38) and choosing ϵ 0 = ϵ .
To conclude the proof, note that if Assumption 5 holds, from [4] (Lemma 6) and taking similar steps as in [11] (Lemma 8), it is implied that for any given ϵ > 0 , with probability one as n :
E P ˜ | ω * ( X , Y , Z ) ω θ ^ ( X , Y , Z ) | θ ^ η , E Q ˜ | ω * ( X , Y , Z ) ω θ ^ ( X , Y , Z ) | θ ^ η ,
where η : = ( 1 τ ) p max 2 λ ϵ / p min , with λ being the Lebesgue measure corresponding to X × Y × Z . Note that the expectations in (A48) are random variables due to θ ^ . Let us define I D V n ( X ; Y | Z ) as:
I D V n ( X ; Y | Z ) : = E P ˜ log Γ ^ ( X , Y , Z ) θ ^ log E Q ˜ Γ ^ ( X , Y , Z ) θ ^ .
Thus by the triangle inequality we have:
| I ^ D V n ( X ; Y | Z ) I ( X ; Y | Z ) | | I ^ D V n ( X ; Y | Z ) I D V n ( X ; Y | Z ) | + | I D V n ( X ; Y | Z ) I ( X ; Y | Z ) | .
where I ^ D V n ( X ; Y | Z ) was defined in (21).
To bound the first term, note that by the triangle inequality
| I ^ D V n ( X ; Y | Z ) I D V n ( X ; Y | Z ) | Δ D V + Δ D V ,
where
Δ D V : = | | B j o i n t α | 1 X , Y , Z B j o i n t α log Γ ^ ( X , Y , Z ) E P ˜ log Γ ^ ( X , Y , Z ) θ ^ |
and
Δ D V : = | log | B p r o d α , s | 1 X , Y , Z B p r o d α , s Γ ^ ( X , Y , Z ) log E Q ˜ Γ ^ ( X , Y , Z ) θ ^ | .
Since Γ ^ ( · ) is bounded as:
τ 1 τ Γ ^ ( X , Y , Z ) 1 τ τ ,
by the Lipschitz continuity of log ( · ) it follows that:
| I ^ D V n ( X ; Y | Z ) I D V n ( X ; Y | Z ) | Δ D V + Δ D V ,
where
Δ D V : = 1 τ τ | | B p r o d α , s | 1 X , Y , Z B p r o d α , s Γ ^ ( X , Y , Z ) E Q ˜ Γ ^ ( X , Y , Z ) θ ^ | .
Both Δ D V and Δ D V converge strongly to zero from Propositions 1 and 2, respectively, i.e., for any given ϵ > 0 , we have that:
P lim n Δ D V > ϵ / 4 = 0 , P lim n Δ D V > ϵ / 4 = 0 .
To bound the second term in (A50), using the triangle inequality it yields that:
| I D V n ( X ; Y | Z ) I ( X ; Y | Z ) | | E P ˜ log Γ ^ ( X , Y , Z ) log Γ ( X , Y , Z ) θ ^ | + | log E Q ˜ Γ ^ ( X , Y , Z ) θ ^ log E Q ˜ Γ ( X , Y , Z ) | .
Thus from (A48) and the Lipschitz continuity of Γ , Γ ^ , and log ( · ) , it follows that:
P lim n | E P ˜ log Γ ^ ( X , Y , Z ) log Γ ( X , Y , Z ) θ ^ | > η τ ( 1 τ ) = 0 , ] P lim n | log E Q ˜ Γ ^ ( X , Y , Z ) θ ^ log E Q ˜ Γ ( X , Y , Z ) | > η τ 2 = 0 .
Then combining (A50) and (A52)–(A55), it is concluded that with probability one as n
| I ^ D V n ( X ; Y | Z ) I ( X ; Y | Z ) | Δ D V + Δ D V + η τ ( 1 τ ) + η τ 2
ϵ 4 + ϵ 4 + ϵ 2 = ϵ ,
where the last step holds by choosing η = τ 2 ( 1 τ ) ϵ 2 , and ϵ and ϵ 0 accordingly. In other words,
I ^ D V n ( X ; Y | Z ) a . s . I ( X ; Y | Z ) ,
and the proof of Theorem 1 is completed. ☐

References

  1. Belghazi, M.I.; Baratin, A.; Rajeshwar, S.; Ozair, S.; Bengio, Y.; Courville, A.; Hjelm, D. MINE: Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 531–540. [Google Scholar]
  2. Wang, Q.; Kulkarni, S.R.; Verdú, S. Universal estimation of information measures for analog sources. Found. Trends Commun. Inf. Theory 2009, 5, 265–353. [Google Scholar] [CrossRef]
  3. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed][Green Version]
  4. Mukherjee, S.; Asnani, H.; Kannan, S. CCMI: Classifier based Conditional Mutual Information Estimation. In Proceedings of the Uncertainty in Artificial Intelligence, Tel Aviv, Israel, 22–25 July 2019. [Google Scholar]
  5. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
  6. Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  7. Donsker, M.D.; Varadhan, S.R.S. Asymptotic evaluation of certain markov process expectations for large time, I. Comm. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]
  8. Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef][Green Version]
  9. Poole, B.; Ozair, S.; van den Oord, A.; Alemi, A.A.; Tucker, G. On variational lower bounds of mutual information. In Proceedings of the NeurIPS Workshop on Bayesian Deep Learning, Montréal, QC, Canada, 7–8 December 2018. [Google Scholar]
  10. Molavipour, S.; Bassi, G.; Skoglund, M. Conditional Mutual Information Neural Estimator. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 5025–5029. [Google Scholar]
  11. Molavipour, S.; Bassi, G.; Skoglund, M. Neural Estimators for Conditional Mutual Information Using Nearest Neighbors Sampling. IEEE Trans. Signal Process. 2021, 69, 766–780. [Google Scholar] [CrossRef]
  12. Marko, H. The bidirectional communication theory-a generalization of information theory. IEEE Trans. Commum. 1973, 21, 1345–1351. [Google Scholar] [CrossRef]
  13. Massey, J. Causality, Feedback and Directed Information. In Proceedings of the International Symposium on Information Theory and Its Applications (ISITA), Honolulu, HI, USA, 27–30 November 1990; pp. 303–305. [Google Scholar]
  14. Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85, 461. [Google Scholar] [CrossRef] [PubMed][Green Version]
  15. Kramer, G. Directed Information for Channels with Feedback. Ph.D. Thesis, Department of Information Technology and Electrical Engineering, ETH Zurich, Zürich, Switzerland, 1998. [Google Scholar]
  16. Permuter, H.H.; Kim, Y.H.; Weissman, T. Interpretations of directed information in portfolio theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory 2011, 57, 3248–3259. [Google Scholar] [CrossRef]
  17. Venkataramanan, R.; Pradhan, S.S. Source coding with feed-forward: Rate-distortion theorems and error exponents for a general source. IEEE Trans. Inf. Theory 2007, 53, 2154–2179. [Google Scholar] [CrossRef]
  18. Tanaka, T.; Skoglund, M.; Sandberg, H.; Johansson, K.H. Directed information and privacy loss in cloud-based control. In Proceedings of the American Control Conference (ACC), Seattle, WD, USA, 24–26 May 2017; pp. 1666–1672. [Google Scholar]
  19. Rissanen, J.; Wax, M. Measures of mutual and causal dependence between two time series (Corresp.). IEEE Trans Inf. Theory 1987, 33, 598–601. [Google Scholar] [CrossRef]
  20. Quinn, C.J.; Coleman, T.P.; Kiyavash, N.; Hatsopoulos, N.G. Estimating the directed information to infer causal relationships in ensemble neural spike train recordings. J. Comput. Neurosci. 2011, 30, 17–44. [Google Scholar] [CrossRef] [PubMed]
  21. Cai, Z.; Neveu, C.L.; Baxter, D.A.; Byrne, J.H.; Aazhang, B. Inferring neuronal network functional connectivity with directed information. J. Neurophysiol. 2017, 118, 1055–1069. [Google Scholar] [CrossRef] [PubMed][Green Version]
  22. Ver Steeg, G.; Galstyan, A. Information transfer in social media. In Proceedings of the 21st international conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 509–518. [Google Scholar]
  23. Quinn, C.J.; Kiyavash, N.; Coleman, T.P. Directed information graphs. IEEE Trans. Inf. Theory 2015, 61, 6887–6909. [Google Scholar] [CrossRef][Green Version]
  24. Vicente, R.; Wibral, M.; Lindner, M.; Pipa, G. Transfer entropy—A model-free measure of effective connectivity for the neurosciences. J. Comput. Neurosci. 2011, 30, 45–67. [Google Scholar] [CrossRef] [PubMed][Green Version]
  25. Chávez, M.; Martinerie, J.; Le Van Quyen, M. Statistical assessment of nonlinear causality: Application to epileptic EEG signals. J. Neurosci. Meth. 2003, 124, 113–128. [Google Scholar] [CrossRef]
  26. Spinney, R.E.; Lizier, J.T.; Prokopenko, M. Transfer entropy in physical systems and the arrow of time. Phys. Rev. E 2016, 94, 022135. [Google Scholar] [CrossRef][Green Version]
  27. Runge, J. Quantifying information transfer and mediation along causal pathways in complex systems. Phys. Rev. E 2015, 92, 062829. [Google Scholar] [CrossRef][Green Version]
  28. Murin, Y. k-NN Estimation of Directed Information. arXiv 2017, arXiv:1711.08516. [Google Scholar]
  29. Faes, L.; Kugiumtzis, D.; Nollo, G.; Jurysta, F.; Marinazzo, D. Estimating the decomposition of predictive information in multivariate systems. Phys. Rev. E 2015, 91, 032904. [Google Scholar] [CrossRef][Green Version]
  30. Baboukani, P.S.; Graversen, C.; Alickovic, E.; Østergaard, J. Estimating Conditional Transfer Entropy in Time Series Using Mutual Information and Nonlinear Prediction. Entropy 2020, 22, 1124. [Google Scholar] [CrossRef]
  31. Zhang, J.; Simeone, O.; Cvetkovic, Z.; Abela, E.; Richardson, M. ITENE: Intrinsic Transfer Entropy Neural Estimator. arXiv 2019, arXiv:1912.07277. [Google Scholar]
  32. Aharoni, Z.; Tsur, D.; Goldfeld, Z.; Permuter, H.H. Capacity of Continuous Channels with Memory via Directed Information Neural Estimator. arXiv 2020, arXiv:2003.04179. [Google Scholar]
  33. Schäfer, A.M.; Zimmermann, H.G. Recurrent neural networks are universal approximators. Int. J. Neural Syst. 2007, 17, 253–263. [Google Scholar] [CrossRef] [PubMed]
  34. Breiman, L. The individual ergodic theorem of information theory. Ann. Math. Stat. 1957, 28, 809–811. [Google Scholar] [CrossRef]
  35. Kontoyiannis, I.; Skoularidou, M. Estimating the directed information and testing for causality. IEEE Trans. Inf. Theory 2016, 62, 6053–6067. [Google Scholar] [CrossRef][Green Version]
  36. Molavipour, S.; Bassi, G.; Skoglund, M. Testing for directed information graphs. In Proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 3–6 October 2017; pp. 212–219. [Google Scholar]
  37. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  38. Devroye, L.; Gyorfi, L.; Krzyzak, A.; Lugosi, G. On the strong universal consistency of nearest neighbor regression function estimates. Ann. Stat. 1994, 22, 1371–1385. [Google Scholar] [CrossRef]
  39. Collomb, G. Nonparametric time series analysis and prediction: Uniform almost sure convergence of the window and k-NN autoregression estimates. Statistics 1985, 16, 297–307. [Google Scholar] [CrossRef]
  40. Yakowitz, S. Nearest-neighbour methods for time series analysis. J. Time Ser. Anal. 1987, 8, 235–247. [Google Scholar] [CrossRef]
  41. Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer Science & Business Media: Dordrecht, The Netherlands, 2012. [Google Scholar]
  42. Raleigh, G.G.; Cioffi, J.M. Spatio-temporal coding for wireless communication. IEEE Trans. Inf. Theory 1998, 46, 357–366. [Google Scholar] [CrossRef]
  43. Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 1969, 37, 424–438. [Google Scholar] [CrossRef]
  44. Kamarianakis, Y.; Prastacos, P. Space–time modeling of traffic flow. Comput. Geosci. 2005, 31, 119–133. [Google Scholar] [CrossRef][Green Version]
  45. Molavipour, S.; Bassi, G.; Čičić, M.; Skoglund, M.; Johansson, K.H. Causality Graph of Vehicular Traffic Flow. arXiv 2020, arXiv:2011.11323. [Google Scholar]
  46. Ross, S.M.; Peköz, E.A. A Second Course in Probability. 2007. Available online: www.bookdepository.com/publishers/Pekozbooks (accessed on 20 May 2021).
  47. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  48. Györfi, L.; Härdle, W.; Sarda, P.; Vieu, P. Nonparametric Curve Estimation from Time Series; Springer: Berlin/Heidelberg, Germany, 2013; Volume 60. [Google Scholar]
  49. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Figure 1. The memory considered for conditional mutual information terms in directed information (left) and transfer entropy (right) at time instance i. To compute directed information (left), the effect of X i (i.e., X i and all its past samples) on Y i is considered, while the history of Y i is excluded. However, for transfer entropy (right), the effect of X i J i 1 (i.e., the previous J samples before X i ) on Y i is accounted for, while we exclude the history of Y i . Note that the length of memories (J and L) for transfer entropy may differ.
Figure 1. The memory considered for conditional mutual information terms in directed information (left) and transfer entropy (right) at time instance i. To compute directed information (left), the effect of X i (i.e., X i and all its past samples) on Y i is considered, while the history of Y i is excluded. However, for transfer entropy (right), the effect of X i J i 1 (i.e., the previous J samples before X i ) on Y i is accounted for, while we exclude the history of Y i . Note that the length of memories (J and L) for transfer entropy may differ.
Entropy 23 00641 g001
Figure 2. Construction of the product batch from the data set which is expressed as the left table. Let w i = 1 , and the z component of the rows denoted with ‘*’ (indexed with j 1 and j 2 ) are in the k nearest neighborhood of z i for k = 2 . So we pack the triples ( x j 1 , y i , z i ) and ( x j 2 , y i , z i ) in the product batch as in the right table.
Figure 2. Construction of the product batch from the data set which is expressed as the left table. Let w i = 1 , and the z component of the rows denoted with ‘*’ (indexed with j 1 and j 2 ) are in the k nearest neighborhood of z i for k = 2 . So we pack the triples ( x j 1 , y i , z i ) and ( x j 2 , y i , z i ) in the product batch as in the right table.
Entropy 23 00641 g002
Figure 3. Estimated CMI for AR-1 model in (30) using n = 2 × 10 4 samples with d = 1 . The shaded region shows the range of the estimated values over the Monte Carlo trials.
Figure 3. Estimated CMI for AR-1 model in (30) using n = 2 × 10 4 samples with d = 1 . The shaded region shows the range of the estimated values over the Monte Carlo trials.
Entropy 23 00641 g003
Figure 4. Estimated CMI for AR-1 model in (30) using n = 2 × 10 4 samples with d = 10 . The shaded region shows the range of the estimated values over the Monte Carlo trials. Blue shades correspond to estimation with our method, yellow shades correspond to estimation with MI-diff approach and the green shade is the overlap of the areas.
Figure 4. Estimated CMI for AR-1 model in (30) using n = 2 × 10 4 samples with d = 10 . The shaded region shows the range of the estimated values over the Monte Carlo trials. Blue shades correspond to estimation with our method, yellow shades correspond to estimation with MI-diff approach and the green shade is the overlap of the areas.
Entropy 23 00641 g004
Figure 5. Causal relationship of the processes.
Figure 5. Causal relationship of the processes.
Entropy 23 00641 g005
Figure 6. Graphical representation of the causal influences between the processes using pairwise directed information (a), and causally conditioned directed information (b).
Figure 6. Graphical representation of the causal influences between the processes using pairwise directed information (a), and causally conditioned directed information (b).
Entropy 23 00641 g006
Table 1. Hyper-parameters.
Table 1. Hyper-parameters.
Hidden units64
Hidden layers2 (64 × 64)
ActivationReLU
τ 10 3
OptimizerAdam
Learning rate 10 3
Epochs200
Table 2. True and estimated DIR.
Table 2. True and estimated DIR.
True DIREstimation with Our Method (Mean ± Std)
I ( X Y ) 0.59 0.57 ± 0.00
I ( X Z ) 0.57 0.55 ± 0.00
I ( Y Z ) 1.99 1.92 ± 0.01
I ( Y X ) 0 0.00 ± 0.00
I ( Z X ) 0 0.00 ± 0.00
I ( Z Y ) 0 0.00 ± 0.00
Table 3. True and estimated DIR.
Table 3. True and estimated DIR.
True DIREstimation with Our Method (Mean ± Std)
I ( X Y Z ) 0.59 0.57 ± 0.00
I ( X Z Y ) 0 0.00 ± 0.00
I ( Y Z X ) 1.42 1.52 ± 0.01
I ( Y X Z ) 0 0.01 ± 0.00
I ( Z X Y ) 0 0.01 ± 0.00
I ( Z Y X ) 0 0.01 ± 0.00
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Molavipour, S.; Ghourchian, H.; Bassi, G.; Skoglund, M. Neural Estimator of Information for Time-Series Data with Dependency. Entropy 2021, 23, 641. https://doi.org/10.3390/e23060641

AMA Style

Molavipour S, Ghourchian H, Bassi G, Skoglund M. Neural Estimator of Information for Time-Series Data with Dependency. Entropy. 2021; 23(6):641. https://doi.org/10.3390/e23060641

Chicago/Turabian Style

Molavipour, Sina, Hamid Ghourchian, Germán Bassi, and Mikael Skoglund. 2021. "Neural Estimator of Information for Time-Series Data with Dependency" Entropy 23, no. 6: 641. https://doi.org/10.3390/e23060641

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop