Higher-Order Interactions and Their Duals Reveal Synergy and Logical Dependence beyond Shannon-Information

Information-theoretic quantities reveal dependencies among variables in the structure of joint, marginal, and conditional entropies while leaving certain fundamentally different systems indistinguishable. Furthermore, there is no consensus on the correct higher-order generalisation of mutual information (MI). In this manuscript, we show that a recently proposed model-free definition of higher-order interactions among binary variables (MFIs), such as mutual information, is a Möbius inversion on a Boolean algebra, except of surprisal instead of entropy. This provides an information-theoretic interpretation to the MFIs, and by extension to Ising interactions. We study the objects dual to mutual information and the MFIs on the order-reversed lattices. We find that dual MI is related to the previously studied differential mutual information, while dual interactions are interactions with respect to a different background state. Unlike (dual) mutual information, interactions and their duals uniquely identify all six 2-input logic gates, the dy- and triadic distributions, and different causal dynamics that are identical in terms of their Shannon information content.


Higher-Order Interactions
All non-trivial structures in data or probability distributions correspond to dependencies among the different features, or variables. These dependencies can be present among pairs of variables, i.e., pairwise, or can be higher-order. A dependency, or interaction, is called higher-order if it is inherently a property of more than two variables and if it cannot be decomposed into pairwise quantities. The term has been used more generally to refer simply to complex interactions, as for example in [1] to refer to changes in gene co-expression over time; in this article, however, it is used only in the stricter sense defined in Section 2.
The reason such higher-order structures are interesting is twofold. First, higher-order dependence corresponds to a fundamentally different kind of communication and interaction among the components of a system. If a system contains higher-order interactions, then its dependency structure cannot be represented by a graph and requires a hypergraph, where a single 'hyperedge' can connect more than two nodes. It is desirable to be able to detect and describe such systems accurately, which requires a good understanding of higher-order interactions. Second, higher-order interactions might play an important role in nature, and have been identified in various interaction networks, including genetic [2][3][4][5], neuronal [6][7][8][9][10], ecological [11][12][13], drug interaction [14], social [15][16][17], and physical [18,19] networks. Furthermore, there is evidence that higher-order interactions are responsible for the rich dynamics [20] or bistability [21] in biological networks; for example, synthetic lethality experiments have shown that the trigenic interactions in yeast form a larger network than the pairwise interactions [4].

Model-Free Interactions and the Inverse Ising Problem
In 1957, E.T. Jaynes famously showed that statistical equilibrium mechanics can be seen as a maximum entropy solution to the inverse problem of constructing a probability distribution that best reproduces a sample distribution [30]. More precisely, the equilibrium dynamics of the (inhomogeneous, or glass-like) generalised Ising model with interactions up to the nth order arise naturally as the maximum entropy distribution compatible with a dataset after observing the first n moments among binary variables. This means that in order to reproduce the moments in the data in a maximally non-committal way, it is necessary to introduce higher-order interactions, i.e., terms that involve more than two variables, in the description of the system. Fitting such a generalised Ising model to data is nontrivial; while the log-likelihood of the Ising model is concave in the the coupling parameters, the cost of evaluating it is exponential in the total number of variables N, which is often intractable in practice [31]. In [32], the authors introduced an estimator of modelfree interactions (MFIs) that exactly coincides with the solution to the inverse generalised Ising problem. Moreover, the cost of estimating all nth-order model-free interactions among N variables from M observations scales as O M · ( N n ) = O(MN n ) (i.e., polynomially) in the total system size N. However, this is true only when sufficient data is available. With limited data, certain interactions might require inferring the conditional dependencies from the data, which in the worst case scales exponentially in N again. The definition of MFIs offered in [32] seems to be a general one; in addition to offering a solution to the inverse generalised Ising problem, MFIs are expressible in terms of average treatment effects (ATEs) or regression coefficients. Throughout this article, the general term 'MFI' is used, and may be read simply as referring to the maximum entropy or Ising interaction.

Outline
In Section 2.1, the definition of the MFIs is stated along with a number of their properties. To explicitly link the MFIs to information theory, a redefinition of mutual information in terms of Möbius inversions is provided in Section 2.2, which is then linked to a similar redefinition of the MFIs in Sections 3.1 and 3.2. A definition in terms of Möbius inversions naturally leads to dual definitions of all objects, which are subsequently explored in Section 3.3. Then, in Section 4, simple fundamental examples are used to demonstrate that MFIs can differentiate distributions that entropy-based quantities cannot. Finally, the results are summarised and reflected upon in Section 5.

Model-Free Interactions
We start by re-defining the interactions introduced in [32]. We define the isolated effect (or 1-point interaction) I (Y) i of a variable X i ∈ X on an observable Y as where the effect of X i on Y is isolated by conditioning on all other variables being zero. This expression is well-defined, as the restriction of a derivative is the derivative of the restriction. A pair of variables X i and X j has a 2-point interaction I ij when the value of X j changes the 1-point interaction of X i on Y: A third variable X k can modulate this 2-point interaction through what we call a 3-point This process of taking derivatives with respect to an increasing number of variables can be repeated to define n-point interactions.
Definition 1 (n-point interaction with respect to outcome Y). Let p be a probability distribution over a set X of random variables X i and let Y be a function Y : X → R. Then, the n-point interaction I X 1 ...X n between variables {X 1 , . . . , X n } ⊆ X is provided by I (Y) X 1 ...X n = ∂ n Y(X) ∂X 1 . . . ∂X n X=0 (4) where X = X \ {X 1 , . . . X n }.
This definition of interaction makes explicit the fact that interactions are defined with respect to some outcome. The authors of [32] refer to the interactions from Definition 1 as additive, which they distinguish from multiplicative interactions. However, when the outcome is chosen to be the log of the joint distribution p(X) over all variables X, then the additive and multiplicative interactions are equivalent and simply related through a logarithm [32]. Setting the outcome to be log p(X) has other nice properties as well. First, while probabilities are restricted to the non-negative reals, a log-transformation removes this restriction and makes the outcome and subsequent interactions take both positive and negative values, which can have different interpretations. Second, it is this outcome that makes the interactions interpretable as maximum entropy interactions, as they exactly coincide with Ising interactions. Finally, this can be considered the most general outcome possible, as all marginal and conditional probabilities are encoded in this joint distribution. This leads to the following definition of a model-free interaction. Definition 2 (model-free n-point interaction between binary variables). A model-free n-point interaction (MFI) is an n-point interaction between binary random variables with respect to the logarithm of their joint probability I X 1 ...X n := I (log p(X)) X 1 ...X n = ∂ n log p(X) where X = X \ {X 1 , . . . X n }.
If the variables X i ∈ X are binary, then a definition for a derivative with respect to a binary variable is needed. Definition 3 (derivative of a function with respect to a binary variable). Let f : B n → R be a real-valued function of a set X of n binary variables, labelled as X i , 1 ≤ i ≤ n. Then, the derivative operator with respect to X i acts on f (X) as follows: The linearity of the derivative operator then immediately and uniquely defines the higher-order derivatives.
Using this definition, the n-point interactions become model-free in the sense that they are ratios of probabilities that do not involve the functional form of the joint probability distribution. For example, writing X ijk = (a, b, c) for (X i = a, X j = b, X k = c), the first three orders can be written out as follows (recall that the notation ∂ ∂X i here refers to the derivative operator from Definition 3): where Bayes' rule is used to replace joint probabilities with conditional probabilities. This definition of interaction has the following properties: • It is symmetric in terms of the variables, as I S = I π(S) for any set of variables S and any permutation π. • Conditionally independent variables do not interact: the definition coincides with that of a log-odds ratio, which has already been considered as a measure of interaction in, e.g., [33,34] If X does not include the full complement of the interacting variables, the bias this induces in the estimate of the interaction is proportional to the pointwise mutual information of states where the omitted variables are 0.

Mutual Information as a Möbius Inversion
The definition of an n-point interaction as a derivative of a derivative is reminiscent of Gregory Bateson's view of information as a difference which makes a difference [35]; however, the relationship between information theory and model-free interactions rests on more than a linguistic coincidence. It turns out that interactions and information are generalised derivatives of similar functions on Boolean algebras. To see this, consider the definition of pairwise mutual information and its third-order generalisation: Note that all MI-based quantities can be written thusly as sums of marginal entropies of subsets of the set of variables. Given a finite set of variables S, its powerset P (S) can be assigned a partial ordering as follows: This poset P = (P (S), ⊆) is called a Boolean algebra, and because each pair of sets has a unique supremum (their union) and infimum (their intersection), it is a lattice. This lattice structure is visualised for two and three variables in Figure 1. In general, the lattice of an n-variable Boolean algebra forms an n-cube. Furthermore, for any finite n, the n-variable Boolean algebra forms a bounded lattice, which means that it has a greatest element, denoted as1, and a least element, denoted as0. On a poset P, we define the Möbius function µ P : P × P → R as This function type makes µ P an element of the incidence algebra of P. In fact, µ is the inverse of the zeta function ζ : ζ(x, y) = 1 iff x ≤ y, and 0 otherwise. On a Boolean algebra, such as a powerset ordered by inclusion, the Möbius function takes the simple form µ(x, y) = (−1) |x|−|y| [36,37]. This definition allows the mutual information among a set of variables τ to be written as follows [38,39]: where P is the Boolean algebra with τ =1 and H(η) is the marginal entropy of the set of variables η. Indeed, this coincides with Equation (11) for τ = {X, Y} and with Equation (13) for τ = {X, Y, Z}. Equation (15) is a convolution known as a Möbius inversion.
Then, the function is called the Möbius inversion of g on P. Furthermore, this equation can be inverted to yield The Möbius inversion is a generalisation of the derivative to posets. If P = (N, ≤), Equation (18) is just a discrete version of the fundamental theorem of calculus [36]. Equation (18) additionally implies that we can express joint entropy as a sum over mutual information: For example, in the case of three variables, Instead of starting with entropy, we could start with a quantity known as surprisal, or self-information, defined as the negative log probability of a certain state or realisation: Surprisal plays an important role in information theory; indeed, the expected surprisal across all possible realisations X = x is the entropy of the variable X: As we are often interested in the marginal surprisal of a realisation X = x summed over Y, we can write this explicitly as With this, consider the Möbius inversion of the marginal surprisal over the lattice P: This is a generalised version of the pointwise mutual information, which is usually defined on just two variables:

Mutual information is the Möbius inversion of marginal entropy. •
Pointwise mutual information is the Möbius inversion of marginal surprisal.

MFIs as Möbius Inversions
With mutual information defined in terms of Möbius inversions, the same can be done for the model-free interactions. Again, we start with (negative) surprisal. However, on Boolean variables a state is just a partition of the variables into two sets: one in which the variables are set to 1, and another in which they are set to 0. That means that the surprisal of observing a particular state is completely specified by which variables X ⊆ Z are set to 1 while keeping all other variables Z \ X at 0, which can be written as S X;Z := log p(X = 1, Z \ X = 0) (27) Definition 5 (interactions as Möbius inversions). Let p be a probability distribution over a set T of random variables and let P = (P (τ), ⊆), the powerset of a set τ ⊆ T ordered by inclusion. Then, the interaction I(τ; T) among variables τ is provided by For example, when τ contains a single variable X ⊆ T, then which coincides with the 2-point interaction in Equation (8). In fact, this pattern holds in general.
Theorem 1 (equivalence of interactions). The interaction I(τ, T) from Definition 5 is the same as the model-free interaction I τ from Definition 2, that is, for any set of variables τ ⊆ T it is the case that Proof. We have to show that Both sides of this equation are sums of ±log p(s), where s is some binary string; thus, we have to show that the same strings appear with the same sign.
First, note that the Boolean algebra of sets ordered by inclusion (as in Figure 1) is equivalent to the poset of binary strings where for any two strings a and b, a ≤ b ⇐⇒ a ∧ b = a. The equivalence follows immediately upon setting each element a ∈ P (S) to the string where a = 1 and S \ a = 0. This map is one-to-one and monotonic with respect to the partial order, as A ⊆ B ⇐⇒ A ∩ B = A. This means that Definition 5 can be rewritten as a Möbius inversion on the lattice of Boolean strings S = (B |τ| , ≤) (shown for the three-variable case on the left side of Figure 2): Note that for any pair (α, τ) where α ⊆ τ with respective string representations (s, t) ∈ B |τ| × B |τ| , we have the following: Thus, we can write To see that this exactly coincides with Definition 2, we can define a map where F B n is the set of functions from n Boolean variables to R. This map is defined as With this map, the Boolean derivative of a function f (X 1 , . . . , X n ) (see Definition 3) can be written as In this way, the derivative with respect to a set S of m variables becomes function composition: From this, it is clear that a term f (s) appears with a minus sign iff e (n) i,0 has been applied an odd number of times. Therefore, terms for which s contains an odd number of 0s receive a minus sign. This can be summarised as Therefore, we can write The sums ∑ s≤1 S and ∑ s∈B n contain exactly the same terms, meaning that Equations (38) and (45) are equal. This completes the proof.
Note that the structure of the lattice reveals structure in the interactions, as previously noted in [32]. On the right-hand side of Figure 2, two faces of the three-variable lattice are shaded. The green region corresponds to the 2-point interaction between the first two variables. The red region contains a similar interaction between the first two variables, except this time in the context of the third variable fixed to 1 instead of 0. This illustrates the interpretation of a 3-point interaction as the difference in two 2-point interactions (I XYZ = I XY|Z=1 − I XY|Z=0 ; note that I XY|Z=0 is usually written as just I XY ). The symmetry of the cube reveals the three different (though equivalent) choices as to which variable to set to 1. Treating the Boolean algebra as a die, where the sides facing up are , , and , we have As before, we can invert Definition 5 and express the surprise of observing a state with all ones in terms of interactions, as follows: For example, in the case where T = {X, Y, Z} and τ = {X, Y} which illustrates that when X and Y tend to be off (I X < 0 and I Y < 0) and X and Y tend to be different (I XY < 0), observing the state (1, 1, 0) is very surprising.

Categorical Interactions
Taking seriously the definition of interactions as the Möbius inversion of surprisal, one might ask what happens when surprisal is inverted over a different lattice instead of using a Boolean algebra. One example is shown in Figure 3; it corresponds to variables that can take three values-0, 1, or 2-where states are ordered by a ≤ b ⇐⇒ ∀i : a i ≤ b i . To calculate interactions on this lattice, we need to know the value of Möbius functions of type µ(s, 22). It can be readily verified that most Möbius functions of this type are zero, with the exceptions of µ (22,22) = µ (11,22) = 1 and µ (21,22) = µ (12,22) = −1, which provide the exact terms in the interactions between two categorical variables changing from 1 → 2 (as defined in [32]). Calculating interactions on different sublattices witĥ 1 = (21), (12) or (11) provides us with the other categorical interactions. The transitivity property of the interactions, i.e., I(X : 0 → 2, Y : 0 → 1) = I(X : 0 → 1, Y : 0 → 1) + I(X : 1 → 2, Y : 0 → 1), follows immediately from the structure of the lattice in Figure 3 and the alternating signs of the Möbius functions on a Boolean algebra. (2 2) The lattice of two variables that can take three values, ordered by a ≤ b ⇐⇒ ∀i: a i ≤ b i .

Information and Interactions on Dual Lattices
Lattices have the property that a set with the reverse order remains a lattice; that is, This raises the question of what corresponds to mutual information and interaction on such dual lattices. Recognising that a poset L = (S, ≤ L ) is a category C with objects S and a morphism f : A → B iff B ≤ L A, these become definitions in the opposite category C op , meaning that they define dual objects.
Let us start with mutual information. We can calculate the dual mutual information, denoted MI * , by first noting that the dual to a Boolean algebra is another Boolean algebra, meaning that we have µ(x, y) = (−1) |x|−|y| . Simply replacing P with P op in Equation (15) yields The dual mutual information of τ =1 P op is simply MI * (∅) = MI(1 P ), that is, the mutual information among all variables. However, the dual mutual information of a singleton set X is where ∆ is known as the differential mutual information and describes the change in mutual information when leaving out X [40], i.e., when marginalising over the variable X. Note that a similar construction was already anticipated in [41] and that the differential mutual information has previously been used to describe information structures in genetics [39].
On the Boolean algebra of three variables {X, Y, Z}, the dual mutual information of X can be written out as follows: Because ∆ is the dual of mutual information, it should arguably be called the mutual co-information; however, the term co-information is unfortunately already in use to refer to normal higher-order mutual information.
To find the dual to the interactions, we start from Equation (36) and construct S op = (B |τ| , ), the dual to the lattice of binary strings S = (B |τ| , ≤). A dual interaction of variables τ ⊆ T is denoted as I * (τ; T), and is defined as follows: Again, when τ =1 Sop =0 S = ∅, this is simply (−1) |τ| I(1 S ), while the dual interaction of a singleton set X is For example, on the three variable lattice in Figure 2, the dual interaction of X is Writing , it can be seen that this is equal to which is similar to the 2-point interaction I YZ defined in Equation (8), now conditioned on X = 1 instead of 0. Note the difference between dual mutual information and dual interactions here; the dual mutual information of X describes the effect on the mutual information from marginalising over X, whereas the dual interaction of X describes the effect on an interaction when fixing X = 1. This reflects a fundamental difference between mutual information and the interactions, in that the former is an averaged quantity and the latter a pointwise quantity. Dual interactions should probably be called co-interactions; however, to avoid confusion with the term co-information, we instead refer to them simply as dual interactions. Dual interactions are interactions that are conditioned on certain variables being 1 instead of 0. This makes them no longer equal to the Ising interactions between Boolean variables; however, there are situations in which an interaction is more interesting in the context of Z = 1 instead of Z = 0, for example, if Z is always 1 in the data under consideration.

•
Mutual information is the Möbius inversion of marginal entropy on the lattice of subsets ordered by inclusion. • Differential (or conditional) mutual information is the Möbius inversion of marginal entropy on the dual lattice. To summarise these relationships diagrammatically, note that surprisals form a vector space as follows. Let P (T) be the powerset of a set of variables T and let |P (T)| = n. This forms a lattice P = (P (T), ⊆) ordered by inclusion, meaning that P (T) can be assigned a topological ordering indexed by i as P (T) = ∪ n i=0 t i . Let S be the set of linear combinations of surprisals of subsets of T: This set is assigned a vector space structure over R by the usual scalar multiplication and addition. Note that the set forms a basis for this vector space, because ∑ i α i log p(t i ) = 0 has no non-trivial solutions and a span(B) = S. Only when two variables a and b are independent do we have linear dependencies in B, as it is then the case that log p(a, b) = log p(a) + log p(b). To define a map from S → R, we only need to specify its action on B and extend the definition linearly. This means that we can fully define the map eval T : S → R by specifying Similarly, we can define the expectation map E : S → R as which outputs the expected surprise over all realisations R = r. Finally, note that the Möbius inversion over a poset P is an endomorphism of the set F P of functions over P, defined as Together, these three maps ensure that the following diagram commutes: For the case where T = {X, Y, Z} and R = {X, Y}, this explicitly amounts to

Results and Examples
While mutual information and model-free interactions are related, there are several important differences in terms of how they capture dependencies. Note, for example, that higher-order information quantities are not independent of the lower-order quantities. The mutual information of three variables is bounded by the pairwise quantities as follows: This means that there are no systems with zero pairwise mutual information and positive higher-order information. This is not true for the interactions. For example, a distribution with 3-point interactions and no pairwise interactions can trivially be constructed as In fact, any positive discrete distribution can be written as a Boltzmann distribution with an energy function that is unique up to a constant, and as such is uniquely defined by its interactions; in other words, each interaction, at any order, can be freely varied to define a unique and valid probability distribution, namely, the Boltzmann distribution of the corresponding generalised Ising model. Note that this is closely related to the fact that a class of neural networks known as restricted Boltzmann machines are universal approximators [42][43][44] and exactly (though not uniquely) encode the Boltzmann distribution of a generalised Ising model in one of their layers [31,45]. Therefore, each distribution is uniquely determined by its set of interactions, and should be distinguishable by them. This is famously not true for entropy-based information quantities, as illustrated below through several examples.

Interactions and Their Duals Quantify and Distinguish Synergy in Logic Gates
Under the assumption of a causal collider structure A → C ← B, nonzero 3-point interactions I ABC can be interpreted as logic gates. A positive 3-point interaction means that the numerator in Equation (10) is larger than the denominator. Under the sufficient (though not necessary) assumption that each term in the numerator is larger than each term in the denominator, we obtain the following truth table as I ABC → +∞: If we consider equally noisy gates such that p G = p and G = , the gates can be directly compared. Note that when a gate has a 3-point interaction I, its logical negation will have a 3-point interaction −I. This determines the 3-point interactions of all six non-trivial logic gates on two inputs, as summarised in Table 1. The two gates with the strongest absolute interactions, XNOR and XOR, are the only two gates that are purely synergistic, i.e., knowing only one of the two inputs provides no information about the output. This relationship to synergy holds for three-input gates as well. The three-input gate with the strongest 4-point interaction has the following truth table: This is a three-input XOR gate, i.e., D = (A + B + C) mod 2, and is again maximally synergistic, as observing only two of the three inputs provides zero bits of information on the output. Setting this maximum 4-point interaction to I, the three-input OR and AND gates receive a 4-point interaction I/4; thus, the hierarchies of interaction and synergy continue to match.
The 3-point interactions are able to separate most two-input logic gates by sign or value, leaving only AND∼NOR and OR∼NAND. Mutual information has less resolving power. Assuming a uniform distribution over all four allowed states from a gate's truth table, a brief calculation yields That is, higher-order mutual information resolves strictly fewer logical gates by value and none by sign. In fact, the higher-order mutual information of a logic gate can never be positive, because it is bounded from above by the minimum of the pairwise mutual information, which is always zero for the pair of inputs. Because all entropy-based quantities inherit the degeneracy summarised in Table 2, neither the mutual information nor its dual can increase the resolving power (see Table 3).  The logic gate interactions and their duals are summarised in Table 3, where it can be seen that neither I * G C = I G ABC + I G AB nor I * G A improve the resolution beyond that of the 3-point interaction. However, the 3-point interaction requires 2 3 = 8 probabilities to achieve this resolving power, whereas I * G C = p 111 p 001 p 101 p 011 achieves the same resolving power with just four probabilities.
However, note that because of a difference in sign convention dual mutual information is a difference between two mutual information quantities, while dual interactions are a sum of two interactions. Based on this, we can consider the difference of two interactions and define a new quantity J * G A = I G ABC − I G BC . We refer to this as a J-interaction. When the MFIs are interpreted in the context of an energy-based model, such as an Ising model or a restricted Boltzmann machine, then the interactions have dimensions of energy, meaning that the J-interactions correspond to the difference in the energy contribution between a triplet and a pair. These J-interactions of the input nodes A and B assign a different value to each logic gate G, and the symmetric J-interaction J * G = J * G A J * G B J * G C , analogous to the symmetric deltas from [40], inherits the perfect resolution from J * G A . Note that while J * G A = J * G B both have perfect resolution, J * G C = I G ABC − I G AB does not improve the resolution beyond that of the 3-point interaction. This results from the fact that in logic gates we have I G ABC = −2I G AB , meaning that I G ABC and I G AB contain the same information. To see this, note that Because the logic gates are symmetric in their inputs, i.e., ∀i, j p ijk = p jik , this can be rewritten as Each of these terms in brackets has the form (p ij1 p ij0 ). Because these are two contradicting states, this product reduces to p regardless of the truth table of G: Note that this pattern could already be observed in Table 3, though it was not yet explained. Thus, the J-interactions of the input nodes uniquely assign a value to each gate proportional to the synergy of its logic. The hierarchy is J * XNOR , which is mirrored for the respective logical complements. XNOR is indeed the most synergistic, while NOR is more synergistic than AND with respect to observing a 0 in one of the inputs; in a NOR gate, a 0 in the input provides no information on the output, while it completely fixes the output of an AND gate. Because the interactions are defined in a context of 0s, they order the synergy accordingly.

Interactions Distinguish Dynamics and Causal Structures
To illustrate how different association metrics reflect the underlying causal dynamics, consider data generated from a selection of three-node causal DAGs as follows. On a given DAG G, first denote the set of nodes without parents, the orphan nodes, by S 0 . Each orphan node in S 0 receives a random value drawn from a Bernoulli distribution, i.e., P(X = 1) = p and P(X = 0) = 1 − p. Next, denote the set of children of orphan nodes as S 1 . Each node in S 1 is then set to either the product of its parent nodes (for multiplicative dynamics) or the mean of its parent nodes (for additive dynamics), plus some zero-mean Gaussian noise with variance σ 2 . Note that for the fork and the chain this simply amounts to a noisy copying operation. All nodes are then rounded to a 0 or 1. A set S 2 is then defined as the set of all children of nodes in S 1 , and these receive values using the same dynamics as before. As long as the causal structure is acyclic, this algorithm terminates on a set of nodes S i that has no children. For example, the chain graph A → B → C has S 0 = {A}, S 1 = {B}, S 2 = {C}, and S 3 = ∅, at which point the updating terminates. Figure 4 shows the results for four different DAGs with multiplicative and additive dynamics (though these are the same for forks and chains). The six different dynamics are represented in four different DAGs, two different (Pearson) correlations, four different partial correlations, and two different mutual information structures, which means that each of these descriptions is degenerate in some of the dynamics. The shown pairwise partial correlations are the correlations among the residuals after a linear regression against the third variable. Because this is similar to conditioning on the third variable, it is somewhat analogous to the MFIs; in fact, when the variables are multivariate normal the partial correlations are encoded in the inverse covariance matrix and are equivalent to pairwise Ising interactions [31]. Indeed, it can be seen that the partial correlations are somewhat able to disentangle direct effects from indirect effects, although they fail to distinguish additive from multiplicative dynamics. Note that only the sign of the association and its significance are represented, as the precise value depends on the noise level σ 2 . The rightmost column shows that the MFIs assign a unique association structure to each of the dynamics, distinguish between direct and indirect effects, and reveal multiplicative dynamics as a 3-point interaction while identifying additive dynamics as a purely pairwise process. Finally, note that both the partial correlation and the MFIs assign a negative association to the parent nodes in a collider structure. This reflects that two nodes become dependent when conditioned on a common effect (cf. Berkson's paradox), a phenomenon already found in partial correlations of metabolomic data in [46]. The mutual information is affected by Berkson's paradox as well, revealed through the negative three-point mutual information. This negative three-point is a direct effect from conditioning on the common effect C,

Higher-Order Categorical Interactions Distinguish Dyadic and Triadic Distributions
That the interactions have such resolving power over distributions of binary variables is perhaps not very surprising in light of the universality of RBMs with respect to this class of distributions. More surprisingly, their resolving power extends to the case of categorical variables. In [47], the authors introduced two distributions, the dyadic and triadic distributions, which are indistinguishable by almost all commonly used information measures (i.e., Shannon, Renyi(2), residual, and Tsallis entropy, co-information, total correlation, CAEKL mutual information, interaction information, Wyner, exact, functional, and MSS common information, perplexity, disequilibrium, and LMRP and TSE complexities).
The two distributions are defined on three variables, each taking a value in a four-letter alphabet {0, 1, 2, 3}. The joint probabilities are summarised in Table 4. To construct the distributions, each category is represented as a binary string ({0, 1, 2, 3} → {00, 01, 10, 11}), leading to new variables {X 0 , X 1 , Y 0 , Y 1 , Z 0 , Z 1 }. The dyadic distribution is constructed by linking these new variables with pairwise rules X 0 = Y 1 , Y 0 = Z 1 , Z 0 = X 1 , while the triadic distribution is constructed with triplet rules X 0 + Y 0 + Z 0 = 0 mod 2 and X 1 = Y 1 = Z 1 . The resulting binary strings are then reinterpreted as categorical variables to produce Table 4.
The authors of [47] found that no Shannon-like measure can distinguish between the two distributions, and argued that the partial information decomposition, which is different for the two distributions, is not a natural information measure, as it has to single out one of the variables as an output. To calculate model-free categorical interactions between the variables, we can set the probabilities of the states in Table 4 uniformly to p = (1 − (64 − 8) )/8 and those of the other states to (i.e., a normalised uniform distribution over legal states). There are a total of 6 3 = 216 interactions such that x 1 > x 0 , y 1 > y 0 , z 1 > z 0 . Each of these can be written as

Higher-order categorical interactions distinguish dy-and triadic interactions
That the interactions have such resolving power over distributions of binary variables is perhaps not so surprising in light of the universality of RBMs with respect to this class of distributions. More surprisingly, their resolving power extends to the case of categorical variables. In [48], the authors introduce two distributions, the dy-and triadic distributions, that are indistinguishable by almost all commonly used information measures (i.e. Shannon-, Renyi(2)-, residual-, and Tsallis entropy, co-information, total correlation, CAEKL mutual information, interaction information, Wyner-, exact-, functional-, and MSS common information, perplexity, disequilibrium, and the LMRP-and TSE complexity).
The two distributions are defined on 3 variables, each taking a value in a 4-letter alphabet {0, 1, 2, 3}. The joint probabilities are summarised in Table 4. To construct the Of particular interest here are the two quantities I XYZ (0 → 3; 0 → 3; 0 → 3) and I XYZ = ∑ x 0 ,x 1 ,y 0 ,y 1 ,z 0 ,z 1 I XYZ (x 0 → x 1 ; y 0 → y 1 ; z 0 → z 1 ), where the sum is over all values such that x 1 > x 0 , y 1 > y 0 , z 1 > z 0 , as all possible pairs necessarily sum to zero because I XYZ (x 0 → x 1 ; y 0 → y 1 ; z 0 → z 1 ) = −I XYZ (x 1 → x 0 ; y 0 → y 1 ; z 0 → z 1 ). For the dyadic distribution, we have while for the triadic distribution we have Thus, this particular 3-point interaction is zero for the dyadic distribution and negative for the triadic distribution. The sum over all three points (see Appendix A.4 for details) is provided by That is, the additively symmetrised 3-point interaction is zero for the dyadic distribution and strongly negative for the triadic distribution. These two distributions, which are indistinguishable in terms of their information structure, are distinguishable by their model-free interactions, which accurately reflect the higher-order nature of the triadic distribution.

Discussion
In this paper, we have related the model-free interactions introduced in [32] to information theory by defining them as Möbius inversions of surprisal on the same lattice that relates mutual information to entropy. We then invert the order of the lattice and compute the order-dual to the mutual information, which turns out to be a generalisation of differential mutual information. Similarly, the order-dual of interaction turns out to be interaction in a different context. Both the interactions and the dual interactions are able to distinguish all six logic gates by value and sign. Moreover, their absolute strength reflects the synergy within the logic gate. In simulations, the interactions were able to perfectly distinguish six kinds of causal dynamics that are partially indistinguishable to Pearson/partial correlations, causal graphs, and mutual information. Finally, we considered dyadic and triadic distributions constructed using pairwise and higher-order rules, respectively. While these two distributions are indistinguishable in terms of their Shannon information, they have different categorical MFIs that reflect the order of the construction rules.
One might wonder why the interactions enjoy this advantage over entropy-based quantities. The most obvious difference is that the interactions are defined in a pointwise way, i.e., in terms of the surprisal of particular states, whereas entropy is the expected surprisal across an ensemble of states. Furthermore, the MFIs can be interpreted as interactions in an Ising model and as effective couplings in a restricted Boltzmann machine. As both these models are known to be universal approximators with respect to positive discrete probability distributions, the MFIs should be able to characterise all such distributions. What is not immediately obvious is that the kinds of interactions that characterise a distribution should reflect properties of that distribution, such as the difference between direct and indirect effects and the presence of higher-order structure. However, in the various examples covered in this manuscript the interactions turn out to intuitively align with properties of the process used to generate the data. While the stringent conditioning on variables not considered in the interaction might make it tempting to interpret an MFI as a causal or interventional quantity, it is important to be very careful when doing this. Assigning a causal interpretation to statistical inferences, whether in Pearl's graphical do-calculus [49] or in Rubin's potential outcomes framework [50], requires further (often untestable) assumptions and analysis of the system in order to determine whether a causal effect is identifiable and which variables to control for. In contrast, an MFI is simply defined by conditioning on all observed variables, makes no reference to interventions or counterfactuals, and does not specify a direction of the effect. While in a controlled and simple setting the MFIs can be expressed in terms of causal average treatment effects [32], a causal interpretation is not justifiable in general.
Moreover, the stringency in the conditioning might worry the attentive reader. Estimating log p(X = 1, Y = 1, T = 0) directly from data means counting states such as (X, Y, T 1 , T 2 , . . . , T N ) = (1, 1, 0, 0, . . . 0), which for sufficiently large N are rare in most datasets. Appendix A.1 shows how to use the causal graph to construct Markov blankets, making such estimation tractable when full conditioning is too stringent. In an upcoming paper, we address this issue by estimating the graph of conditional dependencies, allowing for successful calculation of MFIs up to the fifth order in gene expression data.
One major limitation of MFIs is that they are only defined on binary or categorical variables, whereas many other association metrics are defined for ordinal and continuous variables as well. As states of continuous variables no longer form a lattice, it is hard to see how the definition of MFIs could be extended to include these cases.
Finally, it is worth noting that the structure of different lattices has guided much of this research. That Boolean algebras are important in defining higher-order structure is not surprising, as they are the stage on which the inclusion-exclusion principle can be generalised [36]. However, it is not only their order-reversed duals that lead to meaningful definitions; completely unrelated lattices do as well. For example, the Möbius inversion on the lattice of ordinal variables from Figure 3 and the redundancy lattices in the partial information decomposition [28] both lead to new and sensible definitions of information-theoretic quantities. Furthermore, the notion of Möbius inversion has been generalised to a more general class of categories [51], of which posets are a special case. A systematic investigation of information-theoretic quantities in this richer context would be most interesting.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The funders had no role in the design of the study, in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: Estimating the interaction in Definition 2 from data involves estimating the probabilities of certain states occurring. While we do not have access to the true probabilities, we can rewrite the interactions in terms of expectation values. Note that all interactions involve factors of the type

MI
This allows us to write the 2-point interaction, e.g., as follows: Although expectation values are theoretical quantities, not empirical ones, sample means can be used as unbiased estimators to estimate each term in (A5). The stringent conditioning in this estimator can make the number of samples that satisfy the conditioning very small, which results in the estimates having large variance on different finite samples. Note that if we can find a subset of variables MB X i such that X i ⊥ ⊥ X k | MB X i ∀X k / ∈ MB X i and i = k (in causal language, a set of variables MB X i that d-separates X i from the rest), then we only have to condition on MB X i in (A5), reducing the variance of our estimator. Such a set MB X i is called a Markov Blanket of the node X i . There has recently been a certain degree of confusion around the notion of Markov blankets in biology, specifically with respect to their use in the free energy principle in neuroscience contexts. Here, a Markov blanket refers to the notion of a Pearl blanket in the language of [52]. Because conditioning on fewer variables should reduce the variance of the estimate by increasing the number of samples that can be used for the estimation, we are generally interested in finding the smallest Markov blanket. This minimal Markov blanket is called the Markov boundary.
Finding such minimal Markov blankets is hard; in fact, because it requires testing each possible conditional dependency between the variables, we claim here (without proof) that it is causal discovery-hard, i.e., if such a graph exists it is at least as computationally complex as constructing a causal DAG consistent with the joint probability distribution.

Appendix A.2. Proofs
Markov blankets are not only a computational trick; in theory, only variables that are in each other's Markov blanket can share a nonzero interaction. To illustrate this, first note that the property of being in a variable's Markov blanket is symmetric: Proposition A1 (symmetry of Markov blankets). Let X be a set of variables with joint distribution p(X) and let A ∈ X and B ∈ X such that A = B. We denote the minimal Markov blanket of X by MB X . Then, A ∈ MB B ⇐⇒ B ∈ MB A , and we can say that A and B are Markov-connected.
Consider that which means that B ∈ MB A . Because A ∈ MB B ⇐⇒ B ∈ MB A holds, its negation holds as well, which completes the proof.
This definition of Markov connectedness allows us to state the following.
Theorem A1 (only Markov-connected variables can interact). A model-free n-point interaction I 1...n can only be nonzero when all variables S = {X 1 , . . . , X n } are mutually Markov-connected.
Proof. Let X be a set of variables with joint distribution p(X), let S = {X 1 , . . . , X n }, and let X = X \ S. Consider the definition of an n-point interaction among S: Now, if ∃X j ∈ S such that X j ∈ MB X n , we do not need to condition on X j and can write this as as the probabilities no longer involve X j . Because X j was chosen arbitrarily, this must hold for all variables in S, which means that if any variable in S is not in the Markov blanket of X n then the interaction I S vanishes: Furthermore, as the indexing we chose for our variables was arbitrary, this must hold for any re-indexing, which means that This in turn means that all variables in S must be Markov-connected in order for the interaction I S to be nonzero.
Thus, knowledge of the causal graph aids estimation in two ways: it shrinks the variance of the estimates by relaxing the conditioning, and it identifies the interactions that could be nonzero.
When knowledge of the causal graph is imperfect, it is possible to accidentally exclude a variable from a Markov blanket and thereby undercondition the relevant probabilities. The resulting error can be expressed in terms of the mutual information between the variables, as follows.
Proposition A2 (underconditioning bias). Let S be a set of random variables with probability distribution p(S), let X, Y, and let Z be three disjoint subsets of S. Then, omitting Y from the conditioning set results in a bias determined by (and linear in) the pointwise mutual information that Y = 0 provides about the states of X: Proof. The pointwise mutual information (pmi) is defined as Note that meaning that we can write That is, not conditioning on Y = y results in an error in the estimate of p(X = x 1 | Y = y, Z = z) that is exponential in the Z-conditional pmi of X and Y. However, consider the interaction among X, That is, the error in the interaction as a result of not conditioning on the right variables is linear in terms of the difference between the pmi values of different states.

Appendix A.3. Numerics of Causal Structures
Tables A1-A6 are taken from [48] with permission from the author, and list the precise values leading to Figure 4. From each graph, 100k samples were generated using p = 0.5 and σ = 0.4. To quantify the significance value of the interactions, the data were bootstrap resampled 1k times, resulting in the definition of F as the fraction of resampled interactions having a different sign from the original interaction. The smaller F is, the more significant the interaction.
That is, the error in the interaction as a result of not conditioning on the ri is linear in the difference between the pmi's of different states.

Appendix A.3. Numerics of causal structures
Tables A1 to A6 are taken from [47] with permission from the author, precise values that led to Figure 4. From each graph, 100k samples were gen p = 0.5 and σ = 0.4. To quantify the significance value of the interactions, t bootstrap resampled 1k times, which defined F: the fraction of resampled inte had a different sign from the original interaction. The smaller F is, the more si interaction is.