Towards a Framework for Observational Causality from Time Series: When Shannon Meets Turing

We propose a tensor based approach to infer causal structures from time series. An information theoretical analysis of transfer entropy (TE) shows that TE results from transmission of information over a set of communication channels. Tensors are the mathematical equivalents of these multichannel causal channels. The total effect of subsequent transmissions, i.e., the total effect of a cascade, can now be expressed in terms of the tensors of these subsequent transmissions using tensor multiplication. With this formalism, differences in the underlying structures can be detected that are otherwise undetectable using TE or mutual information. Additionally, using a system comprising three variables, we prove that bivariate analysis suffices to infer the structure, that is, bivariate analysis suffices to differentiate between direct and indirect associations. Some results translate to TE. For example, a Data Processing Inequality (DPI) is proven to exist for transfer entropy.


I. INTRODUCTION
Exact knowledge about the functional relationships that fully determine the behavior of complex systems is a holy grail in the (applied) sciences and engineering.Several methods have been developed to arrive at causal or associational descriptions.The main difference between a causal and an associational description is that for a causal description experimentation is required, whereas for associational description (statistical) data requirements suffice [1].Because interventions are not always possible, we have to make do with the data, a plethora of methods to infer causal structures from observational data have been developed, see for example [2][3][4][5][6].None of these methods seem currently capable of both differentiating between direct and indirect associations (i.e.association via one or more mediators) and determining the directionality, within their own formalism.
In this paper a novel approach inspired by Turing machines [7] is proposed.If causal relations can be computed given the data, a Turing machine exists that "computes" causality, i.e., the causal relation is encoded in the transition function.Transfer Entropy [4] is a measure that can capture causal relations as far as encoded in the probability density functions [8].Instead of inferring the transition functions of the related Turing machines, we derived a tensor formalism utilizing concepts from Information Theory [9].This formalism: (1) is able to determine the directionality of relations within a complex network, (2) can differentiate between direct and indirect associations, and (3) enables simulating the behavior of the network using the inferred relations.We furthermore show that noise is needed for proper causal inference using our framework.

A. Outline
We start this paper with an introduction of aspects from Information Theory that are needed to derive our framework.Next (bivariate) Transfer Entropy (TE) is introduced.Transfer Entropy is capable of detecting directionality and cycles.Using concepts from Information Theory it is shown that TE allows for a tensor based formalism which gives rise to a specific set of calculation rules.We then show that this framework lets us differentiate between direct and indirect relationships.It is also used to derive conditions when this is not possible.We end this paper with an example to illustrate that we are indeed capable of detecting nonlinear relationships.

II. PRELIMINARIES
Statistical independence is foundational to causal inference [5], and therefor also to this paper.We will give a short overview of the two most related and relevant assumptions: (1) The faithfulness assumption.(2) The Causal Markov Condition.A directed graph is said to be faithful to the underlying probability distributions if the independence relations that follow from the graph are the exact same independence relations that follow from the underlying probability distributions.E.g. the faithfulness assumption for the chain X → Y → Z implies that X and Z are independent given Y : p(X, Z|Y ) = p(X|Y )p(Z|Y ).This is denoted as X ⊥ ⊥ Z|Y .
The Causal Markov Condition states that a process is independent of its non-effects, given its direct causes, i.e., its parents.This is relevant in the context of time series.We illustrate the Causal Markov Condition with an example that will be used later in this paper.
Example 1.Let i and g be the parents of j, let g also be the parent of h and let i and j be non-effects of h.According to the Causal Markov Condition i and h are independent given g, i.e. p(j|g, h, i) = p(j|g, i).
selects symbols from Y, and the random variable Z selects symbols from Z. Once encoded the message is transmitted symbol by symbol: the input symbol is transformed into an output symbol.The output alphabet can have a different cardinality than the input alphabet.The transformation from input to output symbol is modeled as a Markov Chain.The probability that a specific input symbol is send and a specific output symbol is received only depends on the alphabet symbol that was send.This implies that the communication process transforms the input probability mass function (pmf) into the output pmf.With x ∈ X a realization of X and y ∈ Y a realization of Y we have p(y) := P r{Y = y} and p(x) := P r{X = x} respectively.
The transmitted message is decoded and made available to the receiver.In this paper we assume that no decoding takes place.

A. Mutual Information
If there is association between two messages, information is said to be shared between them.The measure of the information, the mutual information (MI), is nonnegative and symmetric in X and Y .It represents the reduction in uncertainty about the random variable X given that we have knowledge about the random variable Y .
It is intuitively clear that, given the information content of source data, in subsequent transmission steps the information can never increase.This is formalized in the Data Processing Inequality or DPI: processing of data can never increase the amount of information [10].For the cascade X → Y → Z the DPI implies that, in terms of MI, The maximum rate with which information can be transmitted between the sender and receiver is the channel capacity C XY = max p(x) [I(X; Y )].This is achieved for a so called channel achieving input distribution.

B. The communication channel
In Information Theory the directed graph representing a Markov chain is represented as a communication channel, or channel in short.The channel has an input side (left hand side) and an output side (right hand side).On the left hand side we place all the vertices of the Markov chain with outgoing edges and on the right hand side we place all the vertices of the Markov chain with incoming edges.The input vertices are connected to the output vertices via undirected edges.In a channel every input alphabet symbol has it's own input vertex.Likewise, every output alphabet symbol has it's own output vertex.
The simplest type of channel is the noisy discrete memoryless communication channel (DMC).In a memoryless channel the output (y t ) only depends on the input (x t ) and not on the past inputs or outputs: p(y t |x t , x t−1 , y t−1 ) = p(y t |x t ).A memoryless channel embodies the Markov property.In a noisy channel the output depends on the input and another random variable representing noise.The effect of transmitting data using a DMC is a consequence of the Law of Total Probability [11] because with P r{Y = ψ j } the j th element of p(y), and P r{X = χ i } the i th element of p(x).The transmission of data over a DMC transforms the probability mass function of the input into the pmf of the output via a linear transformation.The probability transition matrix P r{Y = ψ j |X = χ i } fully characterizes the DMC [10].
Assuming a fixed (e.g.lexicographic) order of the alphabet elements, we can introduce an index notation for the pmf's, e.g, p j := P r{Y = ψ j } and p i := P r{X = χ i }.In this paper every index is associated with a specific random variable.In table I an overview is given.

C. Tensor representation of the communication channel
One of the many virtues of Information Theory is that it enables the use of linear algebra.Because we do not want to get overwhelmed by increasingly complex probabilistic equations we use index notation and the Einstein summation convention.This summation convention simplifies equations by implying summation over indices that appear once as an upper, or contra-variant index and once as a lower, or covariant index.Using these we rewrite Eq.(3) as The covariant indices indicate the variables that we condition on.The row stochastic probability transition matrix elements represent the elements of the probability transition tensor A [12].Using the standard notation i.o. the Einstein summation convention, MI can be rewritten as Mutual information depends on the elements of the tensor and the input pmf.This is problematic in case MI or MI derived measures are used to infer the underlying structure, if we assume that the structure is independent from the input.We can illustrate this by assuming that the probability transition tensor equals the Kronecker delta Example 2. Assume that A j i = δ j i , i.e., the symbol received equals the symbol send.In this case Eq.( 5) reduces to I(X; Y ) = i p i log 2 1 p i .Now set the probability of one of the alphabet elements to 1 − ε.This implies that all other symbol probabilities are equal to or smaller than ε.Taking the limit ε → 0 results in a mutual information → 0. In other words, although there might be a perfect, i.e. noiseless, channel that represents the association between the random variables X and Y , MI could be arbitrarily small.This leads us to the following proposition for inferring structures using MI based measures: Proposition 1.In case MI or MI related measures are used to infer the structure for a system, the probability transition tensors or measures based on elements of probability transition tensor should be used.
The earlier mentioned channel capacity is such a measure.It only depends on the elements of the probability transition tensor [13], e.g.C XY := Γ(A).In our example of perfect transmission with an arbitrarily small MI, the channel capacity only depends on the number of alphabet elements: CXY = min [log 2 (|X |), log 2 (|Y|)] [10].This gives rise to the following definition Definition 1 (Normalized channel capacity).The normalized channel capacity is defined as .
Because the channel capacity is the maximal achievable mutual information for a specific channel, the earlier mentioned DPI is also applicable to the channel capacity.
Corollary 1 (DPI for channel capacity).For the chain X → Y → Z the DPI immediately implies that With A representing the tensor of the transmission X → Y , B : Y → Z, and C : X → Z.
The proof is straightforward and therefor omitted.
In this short and incomplete introduction to Information Theory, no assumptions (other than stationarity, ergodicity and Markov property) were made about the underlying mechanisms leading to the association between random variables.In its formulation it can therefor be applied to all cases where observational data are available.

IV. TRANSFER ENTROPY
Schreiber introduced Transfer Entropy in 2000 [4].Like MI it is non-parametric, but unlike MI it is an essentially asymmetric measure and as such it enables the differentiation between a source and a destination.It is an information theoretical implementation of Wieners principle of Causality [14]: a cause combined with the past of the effect predicts the effect better than that the effect predicts itself.In contrast to Granger causality [2], Transfer Entropy is capable of capturing nonlinear relationships.
In this paper we use a slightly modified version which was shown to fully comply to Wieners principle of Causality by Wibral et al.It was proved that this modified TE is maximal for the real interaction delay [15].We assume that Y is a Markov process of order ℓ ≥ 1.This implies that the future y t also depends on it's past y − = (y t−1 , • • • , y t−ℓ ).The destination also depends on the source data X.With τ the finite interaction delay, it is assumed that for the input symbol To be able to differentiate a cause from an effect, two hypotheses have to be assessed: (1) X is the cause and Y is the effect, and (2) Y is the cause and X is the effect.Per case the interaction delay that maximizes the respective TE is determined.If the resulting TE equals 0, it is assumed that there is no relation.Assuming that the TE values are larger than 0, there are in practice two possibilities: (1) The optimal interaction delays are equal: we assume that the hypothesis with the largest TE is valid.
(2) The optimal interaction delays are different: both hypotheses are valid so we have detected a cycle.Without loss of generality we assume in this paper that there are no cycles and that the interaction delays are all equal to 0. Transfer Entropy is a conditional mutual information [4].It is therefor likely that it can be associated with communication channels.We start with conditioning the MI from Eq.( 1) on the event y − = ψ − g resulting in Because x − and y − are the only parents of the output y, it follows from the Causal Markov Condition that the associated channel is memoryless.This sub-channel information quantifies the amount of information that is transmitted over the g th sub-channel.Transfer Entropy of Eq.( 7) can now be expressed as A. The causal channel Equation 9 gives rise to a very specific communication channel: a channel with the topology of an inverse multiplexer.An inverse multiplexer consists of a demultiplexer and a multiplexer in series.A demultiplexer separates an input data stream into multiple output data streams.We call these different streams sub-channels.A multiplexer combines (multiplexes) several input data streams into a single output data stream [16].
Definition 2 (Causal channel).A causal channel is an inverse multiplexer in which the demultiplexer selects the sub-channel over which the data are send based on the past of the output data.Each sub-channel consists of a DMC.The input symbol is fed to a specific input vertex of the chosen DMC.The DMC transforms the input in a probabilistic fashion into an output symbol.The multiplexer combines the outputted symbols into the output message.See Figure 1a.This definition forms the basis for the theorem that is central to this paper.
Theorem 1 (Transfer Entropy results from data transmission over a causal channel).Transfer Entropy is the average conditional mutual information of data transmission over a causal channel.
Proof.The relative frequency with which the g th subchannel is chosen equals p(ψ − g ).Each sub-channel is a DMC, so the mutual information of the g th sub-channel equals I(X; Y |ψ − g ).The weighted average of the mutual information over all the sub-channels is equal to , which is the definition of TE in Eq.( 9).

B. Tensor representation of a causal channel
Because every sub-channel of the causal channel represents a DMC, a causal channel is represented by a probability transition tensor.We will call this tensor a causal tensor.For the relation X → Y we get the following equation for the g th sub-channel The elements of the tensor A are given by A j g î = p(ψj|χ − î , ψ − g ).TE can now be rewritten as In a similar fashion as MI, it can be shown that TE can be made arbitrarily close to 0 while the causal tensor itself represents a noiseless transmission.It is therefor not an optimal measure to infer structures.Again we would prefer to use the tensors themselves or measures based on these tensors like the channel capacity.
The calculation of the channel capacity for a causal channel is not trivial.We assume however that it is possible to determine the channel capacity.
When causal tensors are used to infer the underlying structure it opens the possibility of simulation once the structure has been determined, each edge in a directed causal graph has a corresponding causal tensor.As indicated in the introduction, the approach in the paper was inspired by a Turing machine.The causal tensor is a realization of the transition function of a Turing machine that encodes causality in as far as the causality is encoded in the pmfs.To warrant the use of the adjective "causal" however, we have to show that within the framework of causal tensors we are capable to differentiate between direct and indirect associations.That this seems possible can be intuited when considering the chain X → Y → Z (see Figure 2a).The relation X → Z is a resultant of the other relations, i.e. an indirect association.Within the framework of causal tensors we would expect that this resulting relation can be expressed in terms of the tensors of the other relations once the algebraic rules for manipulating the tensors are known.

V. CALCULATION RULES FOR CAUSAL TENSORS
Every sub-channel described by the causal tensor is a (row) stochastic tensor.Operations performed on these tensors should result in either scalars, stochastic vectors or stochastic tensors.The basic algebraic rules are well known because we can borrow them from linear algebra.
Without loss of generality we assume that a bivariate measurement of the relations within a system consisting of three random variables, results in the directed triangle from Figure 2d.The chain and the fork are other possible structures that lead to the directed triangle being measured.

A. The chain structure
First let the chain X → Y → Z be the ground truth.Additional to Eq.( 10), p j g = p î g A j g î, there are two other causal tensors: B : Y → Z, and C : X → Z.Because it is a straightforward exercise we leave it to the reader to confirm that The index î′ in Eq.(12b) is the index related to the cause x' − ∈ X m ′ of Z.The index î in Eq.( 10) is the index related to the cause x − ∈ X m of Y .The Markov property immediately implies that in both cases we can use the same cause vector, indicated by say î, as long as m ≥ m ′ .
Theorem 2 (Product rule for a chain).Let A and B be the causal tensors of two causal channels in series and let the tensor C represent the resulting indirect causal channel that must be measured in a bivariate approach.
The tensor elements of C are given by For readability we moved the proof to the appendix.If both A and B represent DMC's we get the simpler, well known product rule for a chain of DMC's.
Corollary 2 (Product rule for a chain consisting of DMC's).Let A and B be the causal tensors of two DMC's in series and let the tensor C represent the resulting, indirect, causal channel that must be measured in a bivariate approach.The tensor elements of C are given by Using this corollary leads to a very specific interpretation of Eq.( 13).First define We use the notation in terms of the tensor A and the (¯) operation because it is indicative of the origin of these conditional probabilities.We can now rewrite Eq.( 13) as The causal tensor Āĵ h î is a trivariate tensor: index î is associated with random variable X, index ĵ is associated with random variable Y , and index h is associated with random variable Z respectively.According to corollary 2, this equation can be interpreted as representing two DMC's in series.This means that we have an alternative structure for two causal channels in series as depicted in Figure 1c.Because the Data Processing Inequality is applicable to a cascade of DMC's the alternative structure suggests that there is a DPI for Transfer Entropy.

B. The fork structure
Assume that the fork is the ground truth.The goal is to express the indirect association represented by B in terms of the other causal tensors.First of all we notice that the input distribution can be reconstructed from the output distribution.Definition 3 (Reconstruction).The ‡-operation, or reconstruction operation, reconstructs the source distribution, conditioned of the past of the destination, from the destination distribution, conditioned of the past of the destination: with A ‡ î gj = p î gj .The ‡-operation applied to the directed graph X → Y , results in the graph X ← ‡ Y .This implies that in the framework of causal tensors, a fork has equivalent chains.
The indirect association represented by B in terms of the other two tensors of the chain follows directly from the product rule for a chain (theorem 2).In a bivariate measurement we will always be able to determine the ground truth correctly in the case of the v-structure depicted in Figure 2c.However, investigating structures with a collider, the v-structure and the more general directed triangle, will result in the important concept of interaction.So, lets assume that the ground truth is the directed triangle.We now have to introduce the multivariate relation D : {X, Y } → Z.This relation leads to the additional linear transformation We call the tensor D the interaction tensor.The tensors B and C can be expressed in terms of the tensor D.
Lemma 1 (Causal Tensor Contraction).In the case of a directed triangle we can express the causal tensors in terms of the interaction tensor: We will only derive this relation for Eq.(19a).
Sketch of Proof.First we note that p îĵ h = δ ĵ′ ĵ p ĵ h p î h ĵ′ .With this Eq.(18) is rewritten as: By changing the order of δ From Eq.( 19) it follows that B and C are the result of a cascade involving A, A ‡ and D. The graphs represented by Figures 2e and 2f support the tensor relations, e.g., X → {X, Y } → Z is equivalent to the cascade of the inverse multiplexers represented by A and D resulting in C. Figures 2c and 2d however do not support the calculation rules for causal tensors.Proposition 2. If a complex system contains vstructures, the causal graph must be represented by a directed hypergraph [17].In a hypergraph an edge connects any number of vertices.The interaction tensor corresponds to a so-called hyperedge.
The interaction tensor describes the interaction of inputs at the v-structure.If one of the relations is indirect, no interaction takes place.
Theorem 4 (indirect causes do not contribute to an interaction).The interaction tensor only depends in the direct causes, not on indirect causes.So, if and only if the chain is the ground truth if and only if the fork is the ground truth For the proof we use the fact that the elements of a causal tensor are conditional probabilities.Again due to the fork-chain equivalence, we only need to proof it for a chain.

Sketch of Proof.
Let the ground truth be the chain.
In that case X ⊥ ⊥ Z|Y and X is a non-effect of Z.The index î is associated with X, the index ĵ is associated with Y and the indices h and k are associated with Z.The Causal Markov Condition leads to An immediate consequence of this theorem is that in general a fork, a chain and a directed triangle can be distinguished.The conditions under which it is not possible will be derived later.

Corollary 3. If and only if the chain is the ground truth
If and only the fork is the ground truth In the case of a directed triangle, neither Eq.( 13) nor Eq (17) are valid.
We will only proof this in the case of a chain.

Sketch of Proof.
If the ground truth is a chain, the ground truth is not a fork.According to theorem 4 Combining this with Eq.(19a) results in In the following two examples we will illustrate that indirect associations do not interact.Without loss of generality we assume that the causal tensors represent DMC's. .Assume that x = ( 25 , 3  5 ).The pmf for ȳ equals ȳ = xA ⇒ ȳ = ( 45 , 1  5 ).Using the relation x = ȳA ‡ the reader can verify that 22) is indeed valid.
In the following example it is assumed that the fork is the ground truth.23) is valid.

D. Toward a causal tensor algebra
The calculation rules for causal tensors follow from probability theory [11], Pearls theory of causality [1] and linear algebra.From the examples and derivations thus far we have seen that the operations on and with causal tensors follow very specific rules.These rules can be used to simplify notations even more.
The earlier introduced row stochastic causal tensors A, B and C are used with their respective indices.The stochastic row vectors, i.e. pmfs, are defined as x, ȳ and z so that: xA = ȳ, ȳB = z and xC = z.Furthermore the notation {•} is used to indicated the elements of a tensor.Definition 4. The channel averaging operator (¯) applied to a causal tensor A is defined as with p g h î a trivariate row stochastic tensor The averaging operator plays a role in cascades.
Definition 5.The causal tensor cascading operator ⊙ applied to two causal tensors A and B is defined as a tensor contraction, The number of unique indices of the resulting tensor is always less than the total unique number of indices of the constituting tensors.
Because we use row stochastic tensors, a cascade is read from left to right, e.g., xA ⊙ B is the transformation of the pmf x via the operator A. The resulting output pmf is then transformed via the operator B. Definition 6.The reconstruction operation ‡ reconstructs the input pmf from the output pmf Definition 7. The identity causal tensor I is defined as These definitions lead to the following properties (the proofs are straightforward and therefor omitted): In this section we discuss some of the non-trivial implications when using causal tensors to infer the causal structure from time series data.First we will show that a Data Processing Inequality for Transfer Entropy exists.Because we did not make any assumption about the cardinality of the alphabets used, this DPI is also valid for time-discrete continuous data.
We then proof that we can differentiate between a fork, a chain and a directed triangle as long as the data are noisy, but not "perfectly noisy" (this will be defined later in this paper).

A. The Data Processing Inequality for TE
The DPI for TE gives a sufficient condition to assess if a relation is a proper direct relation.It gives a necessary condition to detect potential indirect relations.
Theorem 5 (Data processing inequality for TE).For the chain X → Y → Z the following inequality holds For the proof a simplified notation for Transfer Entropy and mutual information is used.
We write these measures as a function of pmfs, indicated by (•) and the respective tensor: The subscript h indicates the h th sub-channel representing a DMC.Sketch of Proof.From Eq.( 15), C h = Āh • B h , it follows that for a chain the DPI is valid per sub-channel.
As per Eq.( 9), multiplying both sides by p(ζ − h ), i.e. the probability that the h th channel is selected, and summing over h, results in a DPI for Transfer Entropy.The tensor Āh is itself the result of two cascaded channels represented by A g and a tensor with elements p g îh .For these two DMC's the DPI is also valid, leading to: We now multiply both sides of this equation by p(ζ − h )p(ψ − g ), and sum over h and g, resulting in T E( Ā, •) ≤ T E(A, •).We can now rewrite Eq.(25) as or, equivalently, A similar DPI also exists for the channel capacity for causal tensors (see Eq.( 6)).

B. Differentiating between direct and indirect associations with causal tensors
We have shown earlier that in general a fork, a chain and a directed triangle are distinguishable (see corollary 3).We now investigate in more detail under what conditions this is not possible.Definition 8 (Perfect noisy relation).Iff all causal tensor elements are equal, the relation is a perfect noisy relation.The related causal tensor is called the perfect noisy causal tensor.
The behavior of a perfect noisy causal tensor is straightforward and therefor left to the reader to confirm: (1) any input pmf is transformed into a uniform probability distribution, (2) the channel capacity = 0.The opposite of the perfect noisy causal tensor is the noiseless causal tensor.Definition 9 (Noiseless causal tensor).The elements of a noiseless causal tensor satisfy The reader can verify by using Eq.( 11) that for any input pmf T E = log 2 ĵ 1 .Because the channel capacity of a noiseless channel only depends on the number of alphabet elements, CXY = min log 2 (|X m |), log 2 (|Y ℓ |) [10], our definition is indeed a noiseless causal channel.An immediate consequence of the definition of a noiseless tensor is that the cardinality of the input pmf equals the cardinality of the output pmf.
Theorem 6 (Perfectness).We are not able to differentiate between direct and indirect relations if: (1) all relations are perfectly noiseless, or (2) the relations are perfectly noisy.Sketch of Proof.If both B = Ā ‡ • C and C = Ā • B are valid, causal tensors can not distinguish a fork from a chain.There are two cases that need to considered.In the first case conditions are derived using the causal tensor relations.In the second case we show that the pmfs impose a certain condition.The second case in which a fork and a chain can not be distinguished follows from the pmf transformations:

We start by combining
The output from both the left hand side and right hand side of these equations are probability mass functions.If they are indistinguishable, we can't differentiate between a fork and a chain either.Assume that both B and C are perfect noisy causal tensors.With u(y) and u(x) representing the respective uniform pmfs, Eq.(29) reduces to In [18] an example is given of two perfect noisy relations that interacted resulting in a noiseless transmission.In other words, perfect noisy causal tensors can interact in such a way that the resulting interaction tensor is noiseless.On the other hand, perfect noiseless relations imply maximal redundancy within a data set.

C. Causal inference steps
To finalize the causal tensor framework as discussed so far, a short summary of the (implicitly) proposed steps is given.We assume that: (1) the data are time equidistant, (2) ℓ and m are determined correctly, and (3) the data are ergodic and stationary.
1. Encode the data into a finite alphabet.
2. Determine the (bivariate) causal tensors for a range of interaction delays.
3. Determine the optimal interaction delay.
4. Determine per relation the direction of causation.
5. Identify the potential indirect relations using the DPI.
6. Use the product rule to determine if the indirect relations are indeed indirect.
7. Determine the interaction tensor for perfect noisy relations that collide.
8. If the network is used for simulation, determine the interaction tensors for all v-structures.

VII. EXPERIMENT
We finalize this paper with two experiments to illustrate that nonlinear behavior is indeed captured with causal tensors.

A. Ulam map
For the first experiment we use the one-dimensional lattice of unidirectional coupled maps 0.18, ǫ ≈ 0.82) where no information is shared between maps [4].We chose an alphabet consisting of 4 symbols.The quantization consisted of simple binning.Furthermore we chose ℓ = m = 1 (see Eq.( 7)).Instead of maximizing TE we maximized the channel capacity to determine the optimal delay.An approximation that satisfies the boundaries that follow from Eq.( 9) was used, To determine the channel capacities the Blahut-Arimoto algorithm was used [19].The delays were varied between 1 and 20.The Channel capacity was maximal for a delay of 1 sample.As can be seen from Figure 3, causal tensors lead to a similar result as Transfer Entropy.

B. Coupled Ornstein-Uhlenbeck processes
In the second experiment we demonstrate our approach using a system of four coupled Ornstein-Uhlenbeck processes [3]: ẇ(t) = −0.8w(t)− 0.4y(t − 3) 2 + 0.05y(t − 3) + ηw(t), (32) with independent unit variance white noise processes η.The integration time step was dt = 0.01s and the sampling interval ∆s = 100s.A binary encoding scheme was used.First the data was normalized after which it was partitioned at 0.5.Because the Shannon entropy of the encoded data was close to 1, we expect highly noisy communication channels.The disadvantage of binary encoding is that more data is needed to capture the transmitted information.On the other hand cascading very noisy channels reduces the probability of detecting an indirect relation.This is illustrated in Figure 4, no pruning was needed.This experiment shows that causal tensors are indeed capable of detecting the underlying structure.

VIII. CONCLUSION
To conclude, we used Transfer Entropy to come to a tensor formalism with which causal structures can be inferred.Theorems were established that allow us to differentiate between direct and indirect associations and we showed the importance of noise within this formalism.Using this formalism a Data Processing Inequality was proved to exist for TE.Finally, the formalism allows for simulating the behavior of the inferred system because an edge is represented by a tensor i.o. a scalar.

Appendix: Proof of product rule
Theorem 2 (Product rule for a chain).Let A and B be the causal tensors of two causal channels in series and let the tensor C represent the resulting indirect causal channel that must be measured in a bivariate approach.The tensor elements of C are given by For the proof we need to introduce two lemma's.Sketch of Proof.Another direct consequence of the Markov property is related to indices associated with the same random variable.As long as the index related to the past of the output, e.g.g, and the index related to the output, e.g.j appear in the same tensor we are allowed to replace the output index by the input index.In our example this means we are allowed to replace j by ĵ as long as we ensure that ψ − ĵ = {ψj , ψ − g }.This is always possible due to the Markov property: we either enlarge the cardinality of ψ − ĵ or ψ − g .Lemma 3.For the chain X → Y → Z we have A ĵ îgh = A ĵ îg .
For the proof we refer to example 1.
Sketch of Proof.Because of the Law of Total Probability we are allowed to condition Eq.( 10) on h and both Eq.(12a) and Eq.(12b) on g.This leads to

FIG. 1 .
FIG. 1.(a) Causal channel.(b) Two causal channels in series representing the communication model related to Transfer Entropy for the cascade X → Y → Z. (c) The equivalent causal channel for 2 causal channels in series.

FIG. 2 .
FIG. 2. The basic structures directed graph structures: (a) the chain, (b) the fork, (c) the v-structure, and (d) the directed triangle.The graphs (e) and (f) reflect the calculation rules for the causal tensors for the v-structure and directed triangle respectively.

Example 3 .B
Let the chain X → Y → Z be the ground truth.With A = association is represented by the causal tensor C = A •

Example 4 .
Let the fork X → Y + X → Z be the ground truth.Assume that A = The pmf for ȳ equals ȳ = xA ⇒ ȳ = ( 4 5 , 1 5 ).As in the previous example A ‡ = association is represented by the causal tensor B = A ‡ • C ⇒ B =

These equations are valid when I 1 =
Ā ‡ • Ā and I 2 = Ā• Ā ‡ , with I 1 and I 2 identity causal tensors.Per definition identity tensors are noiseless.Because the causal tensors are stochastic tensors, their elements are nonnegative.The product of two stochastic tensors can only equal a noiseless tensor iff both Ā and Ā ‡ are noiseless.Along the same line of reasoning we finally arrive to the conclusion that A and A ‡ are noiseless causal tensors because the averaging operation is in fact a matrix multiplication of two tensors.

FIG. 4 .
FIG. 4. (a)The causal structure for the Ornstein-Uhlenbeck system of Eq.(32).The other graphs show the inferred causal structures at different time series lengths.The confidence interval was 90% and the maximum delay was set to 20s: (b) T = 10ks, (c) T = 100ks.In (d), T = 500ks, the interaction delays that maximized the channel capacity are also shown.
p ĵ gh = p î gh A ĵ gh î, (A.1a)p k gh = p ĵ gh B k gh ĵ , (A.1b) p k gh = p î gh C k gh î.(A.1c)Substituting the expression for p ĵ gh of Eq.(A.1a) in Eq.(A.1b) and combining the result with Eq.(A.1c) gives usC k gh î = A ĵ gh îB k gh ĵ .(A.2)Using lemma (2) and lemma(3)  this can be rewritten asC k gh î = A ĵ g îB k h ĵ .(A.3)Finally we multiply both sides with p g h î.As the reader confirm, the term p g h îC k gh î equals C k h îleading to Eq.(2).
The number of elements in the alphabet, i.e. the cardinality, is denoted as |X |, |Y|, and |Z| respectively.

TABLE I .
Overview of indices used throughout the paper.
Equation (17a) is applicable in the case depicted in Figure2d, i.e. a bivariate measurement between Y and Z results in an indirect association with Y as the cause and Z as the effect.The equivalent chain is Y → ‡ X → Z .In the case that for the indirect association Z is the cause and Y the effect, Eq.(17b) is applicable.The equivalent chain in that case is Y ← X ← ‡ Z is.
C. The v-structure & the directed triangle interesting because there are two regions (ǫ ≈ FIG.3.Transfer Entropy and the channel capacity of the causal tensor for two unidirectionally coupled Ulam maps X 1 and X 2 as function of the coupling strength ǫ.Only the relation X 1 → X 2 is shown.Dots: approximated channel capacity for the causal channel.Line: Transfer Entropy as determined by Schreiber.