Multivariate Dependence Beyond Shannon Information

Accurately determining dependency structure is critical to discovering a system's causal organization. We recently showed that the transfer entropy fails in a key aspect of this---measuring information flow---due to its conflation of dyadic and polyadic relationships. We extend this observation to demonstrate that this is true of all such Shannon information measures when used to analyze multivariate dependencies. This has broad implications, particularly when employing information to express the organization and mechanisms embedded in complex systems, including the burgeoning efforts to combine complex network theory with information theory. Here, we do not suggest that any aspect of information theory is wrong. Rather, the vast majority of its informational measures are simply inadequate for determining the meaningful dependency structure within joint probability distributions. Therefore, such information measures are inadequate for discovering intrinsic causal relations. We close by demonstrating that such distributions exist across an arbitrary set of variables.

The past two decades, however, produced a small but important body of results detailing how standard Shannon information measures are unsatisfactory for determining some aspects of dependency and shared information. Within information-theoretic cryptography, the conditional mutual information has proven to be a poor bound on secret key agreement [26,27]. The conditional mutual information has also been shown to be unable to accurately measure information flow [28, and references therein]. Finally, the inability of standard methods of decomposing the joint entropy to provide any semantic understanding of how information is shared has motivated entirely new methods of decomposing information [29,30]. Common to all these is the fact that conditional mutual information conflates intrinsic dependence with conditional dependence.
Here, we demonstrate a related, but deeper issue: Shan-non information measures-entropy, mutual information, and their conditional and multivariate versions-can fail to distinguish joint distributions with vastly differing internal dependencies.
Concretely, we start by constructing two joint distributions, one with dyadic subdependencies and the other with strictly triadic subdependencies. From there, we demonstrate that no standard Shannon-like information measure, and exceedingly few nonstandard methods, can distinguish the two. Stately plainly: when viewed through Shannon's lens, these two distributions are erroneously equivalent. While distinguishing these two (and their internal causality) may not be relevant to a mathematical theory of communication, it is absolutely critical to a mathematical theory of information storage, transfer, and modification [31]. We then demonstrate two ways in which these failures generalize to the multivariate case. The first generalizes our two distributions to the multivariate and polyadic case via "dyadic camouflage". And, the second details a method of embedding an arbitrary distribution into a larger variable space using hierarchical dependencies, a technique we term "dependency diffusion". In this way, one sees that the initial concerns about information measures can arise in virtually any statistical multivariate analysis.
In this short development, we assume a working knowledge of information theory, such as found in standard textbooks [32][33][34][35]. Triadic   TABLE I. The (a) dyadic and (b) triadic probability distributions over the three random variables X, Y , and Z that take values in the four-letter alphabet {0, 1, 2, 3}. Though not directly apparent from their tables of joint probabilities, the dyadic distribution is built from dyadic (pairwise) subdependencies while the triadic from triadic (three-way) subdependencies.

DEVELOPMENT
We begin by considering the two joint distributions shown in Table I. The first represents dyadic relationships between three random variables X, Y , and Z. And, the second triadic 1 between them. These appellations are used for reasons that will soon be apparent. How are these distributions structured? Are they structured identically, or are they qualitatively distinct?
We can develop a direct picture of underlying dependency structure by casting the random variables' four-symbol alphabet used in Table I into composite binary random  variables, as displayed in Table II. It can be readily verified that the dyadic distribution follows three simple rules: The triadic distribution similarly follows simple rules: X 0 + Y 0 + Z 0 = 0 mod 2 (the xor relation [37], or, equivalently, any one of them is the xor of the other two), and X 1 = Y 1 = Z 1 . Two triadic rules. These dependency structures are represented pictorially in Fig. 1. Our development from this point on will not use any knowledge of this structure, but rather it will attempt to determine the structure using only information measures.
HC, Tabu, MMHC, and RSMAX2 methods of the bnlearn [39] package, version 4.0. Each of these methods, when run with its default parameters, resulted in the Bayesian network depicted in Fig. 2: three lone nodes. Despite the popularity of Bayesian networks for modeling dependencies, the failure of these methods is not surprising: the dyadic and triadic distributions violate a basic premise of Bayesian networks. Namely, that their dependency structure cannot be represented by a directed acyclic graph. This also implies that the methods of Pearl [40] and its generalizations [41] give no insight on structure in these distributions. And so, we leave Bayesian network inference aside.
What does an informational analysis say? Both the dyadic and triadic distributions describe events over three variables, each with an alphabet size of four. Each consists of eight joint events, each with a probability of 1 /8. As such, each has a joint entropy of H [X, Y, Z] = 3 bit. 2 Our first observation is that any entropy-conditional or not-and any mutual information-conditional or not-will be identical for the two distributions. Specifically, the entropy of any variable conditioned on the other two vanishes: Graphical depiction of the result from applying several Bayesian network inference algorithms to both the dyadic and triadic distributions. The algorithms do not find any dependence between the variables X, Y , and Z, resulting in three disconnected nodes. The algorithms' failure is not surprising: the dyadic and triadic distributions cannot be represented by a directed acyclic graph, a basic assumption of Bayesian network inference. As a brief aside, it is of interest to note that it has been suggested (e.g., in Refs. [44,45], among others) that zero co-information implies that at least one variable is independent of the others-that is, in this case, a lack of three-way interactions. Krippendorff [46] early on demonstrated that this is not the case, though these examples more clearly exemplify this fact.
We now turn to the implications of the two information diagrams, Figs. 3a and 3b, being identical. There are measures [20,22,44,[47][48][49][50][51][52][53] and expansions [54][55][56] purporting to measure or otherwise extract the complexity, magnitude, or structure of dependencies within a multivariate distribution. Many of these techniques, including those just cited, are sums and differences of atoms in these information diagrams. As such, they are unable to differentiate these distributions.
To drive home the point that the concerns raised here are very broad, Table III enumerates the results of applying a great many information measures. It is organized in to four sections: entropies, mutual informations, common informations, and other measures.
None of the entropies, dependent only on the probability mass function of the distribution, can distinguish the two distributions. Nor can any of the mutual informations, as they are functions of the information atoms in the I-diagrams of Fig. 3.
The common informations, defined via auxiliary variables satisfying particular properties, can potentially isolate differences in the dependencies. Though only one of them-the Gács-Körner common information K [•] [57,58], involving the construction of the largest "subrandom variable" common to the variables and highlighted in the table-discerns that the two distributions are not equivalent because the triadic distribution contains the subrandom variable X 1 = Y 1 = Z 1 common to all three variables.
Finally, only two of the other measures (also highlighted) identify any difference between the two. Some fail because they are functions of the probability mass function. Others, like the TSE complexity [59] and erasure entropy [60], fail since they are functions of the I-diagram atoms. Only the intrinsic mutual information I [• ↓ •] [26] and the reduced mutual information I [• ⇓ •] [27] distinguish the two since the dyadic distribution contains three dyadic subvariables each of which is independent of the third variable, whereas in the triadic distribution the conditional dependence of the xor relation can be destroyed. Figure 4 demonstrates three different information expansions-that, roughly speaking, group variables into subsets of difference sizes or "scales"-applied to our distributions of interest. The first is the complexity profile [55]. At scale k, the complexity profile is the sum of all I-diagram atoms consisting of at least k variables conditioned on the others. Here, since the I-diagrams are identical so are the complexity profiles. The second profile is the marginal utility of information [56], which is a derivative of a linear programming problem whose constraints are given by the I-diagram so here, again, they are identical. Finally, we have Schneidman et al.'s  connected informations [70], which are the differences in entropies of the maximum entropy distributions whose k-and k − 1-way marginals are fixed to match those of the distribution of interest. Here, all dependencies are detected once pairwise marginals are fixed in the dyadic distribution, but it takes the full joint distribution to realize the xor subdependency in the triadic distribution.
Neither the transfer entropy [20], the transinformation [53], the directed information [52], the causation entropy [22], nor any of their generalizations based on conditional mutual information are capable of determining that the dependency structure, and therefore the causal structure, of the two distributions are qualitatively different. This defect fundamentally precludes them from determining causal structure within a system of unknown dependencies. It underlies our prior criticism of these functions as measures of information flow [28]. 3 A promising approach to understanding informational dependencies is the partial information decomposition (PID) [29]. This framework seeks to decompose a mutual Another somewhat similar approach is that of integrated information theory [76]. However, this approach requires 3 As discussed there, the failure of these measures stems from the possibility of conditional dependence, whereas the aim for these directed measures is to quantify the information flow from one time series to another excluding influences of the second. In this light, T X→Y = I X t 0 : Yt ↓ Y t 0 [26] is certainly an incremental improvement over the transfer entropy. 4 Here, we quantified the partial information lattice using the nowcommonly accepted technique of Ref. [71], though calculations using two other techniques [72,73] match. The original PID measure I min , however, assigns both distributions 1 bit of redundant information and 1 bit of synergistic information.
We have yet to compute two other recent proposals [74,75], though we suspect they will match Ref. [71]'s values. Suite of information expansions applied to the dyadic and triadic distributions: the complexity profile [55], the marginal utility of information [56], and the connected information [70]. The complexity profile and marginal utility of information profiles are identical for the two distributions as a consequence of the information diagrams (Fig. 3) being identical. The connected informations, quantifying the amount of dependence realized by fixing k-way marginals, is able to distinguish the two distributions. Note that although all the x-axes are each scale, exactly what that means depends on the measure. a known dynamic over the variables and is, in addition, highly sensitive to the dynamic. Here, in contrast, we considered only simple probability distributions without any assumptions as to how they might arise from the dynamics of interacting agents. That said, one might associate an integrated information measure with a distribution via the maximal information integration over all possible dynamics that give rise to the distribution. We leave this task for a later study. Partial information decomposition diagrams for the (a) dyadic and (b) triadic distributions. Here, X and Y are treated as inputs and Z as output, but in both cases the decomposition is invariant to permutations of the variables. In the dyadic case, the relationship is realized as 1 bit of unique information from X to Z and 1 bit of unique information from Y to Z. In the triadic case, the relationship is quantified as X and Y providing 1 bit of redundant information about Z while also supplying 1 bit of information synergistically about Z.

DISCUSSION
The broad failure of Shannon information measures to differentiate the dyadic and polyadic distributions has far-reaching consequences. Consider, for example, an experiment where a practitioner places three probes into a cluster of neurons, each probe touching two neurons and reporting 0 when they are both quiescent, 1 when the first is excited but second quiescent, 2 when the second is excited but the first quiescent, and 3 when both are excited. Shannon-like measures-including the transfer entropy-would be unable to differentiate between the dyadic situation consisting of three pairs of synchronized neurons, the triadic situation consisting of a trio of synchronized neurons, and a trio exhibiting the xor relationa relation requiring nontrivial sensory integration. Such a situation might arise when probing the circuitry of the drosophila melanogaster connectome [77], for instance. Furthermore, while partitioning each variable into subvariables made the dependency structure clear, we do not believe that such a refinement should be a necessary step in discovering such structure. Consider that we demonstrated that refinement is not strictly needed, since the partial information decomposition was able to discover the distribution's internal structure without it.
These results, observations, and the broad survey clearly highlight the need to extend Shannon's theory. In particular, the extension must introduce a fundamentally new measure, not merely sums and differences of the standard Shannon information measures. While the partial information decomposition was initially proposed to overcome the interpretational difficulty of the (potentially negative valued) co-information, we see here that it actually overcomes a vastly more fundamental weakness with Shannon information measures. While negative information atoms can subjectively be seen as a flaw, the inability to distinguish dyadic from polyadic relations is a much deeper and objective issue.
This may lead one to consider the partial information decomposition as the needed extension to Shannon theory. We do not. The partial information decomposition depends on interpreting some random variables as "inputs" and others as "outputs". While this may be perfectly natural in some contexts, it is not satisfactory in general. Consider: how should the triadic distribution's information be allotted? Certainly one of its three bits is redundant and one is synergistic, but what about the third? The xor (dyadic) distribution contains two bitsare both to be considered synergy [78]? In any case, the partial information framework does not address this question.

DYADIC CAMOUFLAGE & DEPENDENCY DIFFUSION
The dyadic and triadic distributions we analyzed thus far were deliberately chosen to have small dimensionality in an effort to make them and the failure of Shannon information measures as comprehensible and intuitive as possible. Since a given data set may have exponentially many different three-variable subsets, even just this pair of trivariate distributions will stymie most any assessment of variable dependency. However, this is simply a starting point. We will now demonstrate that there exist distributions of arbitrary size which mask their kway dependencies in such a way that they match, from a Shannon information theory prospective, a distribution of the same size containing only dyadic relationships. Furthermore, we show how any such distribution may be obfuscated over any larger set of variables. This likely mandates a search over all partitions of all subsets a system, making the problem of finding such distributions in the EXPTIME computational complexity class [79].
Specifically, consider the 4-variable parity distribution consisting of four binary variables such that each variable's value is equal to the parity of the remaining three. This is a straightforward generalization of the xor distribution used in constructing the triadic distribution. We next need a generalization of the "giant bit" [63]-which we call dyadic camouflage-to mix with the parity, informationally "canceling out" the higher-order mutual informations even though dependencies of such orders exist in the distribution. An example dyadic camouflage distribution for four variables is given in Fig. 6.
Generically, an n-variable dyadic camouflage distribution has an alphabet size of 2 n−2 and consists of 2 (n−2)·(n−1) 2 equally likely outcomes, both numbers determined due to entropy considerations. The distribution is constrained such that any two variables are completely determined by the remaining n − 2. Moreover, each m-variable subdistribution has equal entropy and, otherwise, is of maximal joint entropy. One method of constructing such a distribution is to begin by writing down one variable in increasing lexicographic order such that it has the correct number of outcomes; e.g., column W in Fig. 6a. For each remaining variable, uniformly apply permutations of increasing size to its set of values.
Finally, one can obfuscate any distribution by embedding it in a larger collection of random variables. Given a distribution D over n variables, associate each random variable i of D with a k-variable subset of a distribution D in such a way that there is a mapping from the k outcomes in the subset of D to the outcome of the variable i in D. For example, one can embed the xor distribution over X, Y, Z into six variables X 0 , X 1 , Y 0 , Y 1 , Z 0 , Z 1 via In other words, the parity of (Z 0 , Z 1 ) is equal to the xor of the parities of (X 0 , X 1 ) and (Y 0 , Y 1 ). In this way one must potentially search over all partitions of all subsets of D in order to discover the distribution D hiding within. We refer to this method of obfuscation as dependency diffusion.
The first conclusion is that the challenges of conditional dependence can be found in joint distributions over arbitrarily large sets of random variables. The second conclusion, one that heightens the challenge to discovery, is that even finding which variables are implicated in polyadic dependencies can be exponentially difficult. Together the camouflage and diffusion constructions demonstrate how challenging it is discover, let alone work with, multivariate dependencies. This difficulty strongly implies that the current state of information-theoretic tools is vastly underpowered for the types of analyses required of our modern, data-rich sciences.
It is unlikely that the parity plus dyadic camouflage distribution discussed here is the only example of Shannon measures conflating the arity of dependencies and thus producing an information diagram identical to that of a qualitatively distinct distribution. This suggests an important challenge: find additional, perhaps simpler, joint distributions exhibiting this phenomenon.

CONCLUSION
To conclude, we constructed two distributions that cannot be distinguished using conventional (and many nonconventional) Shannon-like information measures. In fact, of the more than two dozen measures we surveyed only five were able to separate the distributions: the Gács-Körner common information, the intrinsic mutual information, the reduced intrinsic mutual information, the connected informations, and the partial information decomposition. We also noted in an aside that causality detection approaches that assume an underlying directed acyclic graph structure are structurally impotent.
The failure of the Shannon-type measures is perhaps not surprising: nothing in the standard mathematical theories of information and communication suggests that such measures should be able to distinguish these distributions [80]. However, distinguishing dyadic from triadic relationships and the related causal structure is of the utmost importance to the sciences. Critically, since interpreting dependencies in random distributions is traditionally the domain of information theory, we propose that new extensions to information theory are needed.
Furthermore, the dyadic camouflage distribution (Fig. 6) presents an acute challenge for traditional methods of dependency and causality inference. Let's close with an example. Consider the widely used Granger causality [81] applied to the camouflage distribution. Fixing any two variables, say X and Y , determines the remaining two, Z and W . What is one to conclude from this, other than that X cannot influence Y ? And yet, in conjunction with either Z or W , X completely determines Y . This makes clear the deep assumption of dyadic relationships that permeates and biases our ways of thinking about complex systems.
These results may seem like a deal-breaking criticism of employing information theory to determine dependencies. Indeed, these results seem to indicate that much existing empirical work and many interpretations have simply been wrong and, worse even, that the associated methods are misleading while appearing quantitatively consistent. We think not, though. With the constructive and detailed problem diagnosis given here, at least we can see the true problem. It is now a necessary step to address it. This leads us to close with a cautionary quote: The tools we use have a profound (and devious!) influence on our thinking habits, and, therefore, on our thinking abilities.
Edsger W. Dijkstra [82] ACKNOWLEDGMENTS We thank N. Barnett and D. P. Varn for helpful feedback. JPC thanks the Santa Fe Institute for its hospitality during visits as an External Faculty member. This material is based upon work supported by, or in part by, the U. S. Army Research Laboratory and the U. S. Army Research Office under contracts W911NF-13-1-0390 and W911NF-13-1-0340.

A Python Discrete Information Package
Hand calculating the information quantities used in the main text, while profitably done for a few basic examples, soon becomes tedious and error prone. We provide a Jupyter notebook [83] making use of dit ("Discrete Information Theory") [84], an open source Python package that readily calculates these quantities.