Bivariate Partial Information Decomposition : The Optimization Perspective

Bertschinger, Rauh, Olbrich, Jost, and Ay (Entropy, 2014) have proposed a definition of a decomposition of the mutual information MI(X : Y, Z) into shared, synergistic, and unique information by way of solving a convex optimization problem. In this paper, we discuss the solution of their Convex Program from theoretical and practical points of view.


Introduction
Bertschinger, Rauh, Olbrich, Jost, and Ay [1] have proposed to compute a decomposition of the mutual information MI(X : Y, Z) into shared, synergistic, and unique contributions by way of solving a convex optimization problem.It is important to mention that William and Beer in [2] were the first to propose a measure for information decomposition.That measure suffered from serious flaws, which prompted a series of other papers [3][4][5]  where MI denotes mutual information, and (x, y, z) ∼ q stands for (x, y, z) is distributed according to the probability distribution q.The minimum ranges over all probability distributions q ∈ [0, 1] X×Y×Z which satisfy the marginal equations q x,y, * = b y x,y for all (x, y) ∈ X × Y q x, * ,z = b z x,z for all (x, z) ∈ X × Z.
We use the notational convention that an asterisk " * " is to be read as "sum over everything that can be plugged in here"; e.g., q x,y, * := ∑ w q x,y,w .
(We don't use the symbol * in any other context).
As pointed out in [1], the function q → MI (x,y,z)∼q (x : y, z) is convex and smooth in the interior of the convex region, and hence, by textbook results on convex optimization, for every ε > 0, an ε-approximation of the optimum-i.e., a probability distribution q for which MI (x,y,z)∼q (x : y, z) is at most ε larger than the minimum-can be found in at most O(α 2 /ε 2 ) rounds of an iterative algorithm (e.g., Proposition 7.3.1 in [6]), with a parameter α (the asphericity of the feasible region) depending on the marginals b y , b z .Each iteration involves evaluating the gradient of MI (x,y,z)∼q (x : y, z) with respect to q, and O(|X × Y × Z|) additional arithmetic operations.
In Appendix A of their paper, Bertschinger et al. discuss practical issues related to the solution of the convex optimization problem: They analyze the feasible region, solve by hand some of the problems, and complain that Mathematica could not solve the optimization problem out of the box.
Our paper is an expansion of that appendix in [1].Driven by the need in application areas (e.g., [7]) to have a robust, fast, out-of-the-box method for computing CI, we review the required convex optimization background; discuss different approaches to computing CI and the theoretical reasons contributing to the poor performance of some of them; and finally compare several of these approaches on practical problem instance.In conclusion, we offer two different practical ways of computing CI.
The convex optimization problem in [1] is the following: minimize MI (x,y,z)∼q (x : y, z) over q ∈ R X×Y×Z subject to q x,y, * = b y x,y for all (x, y) ∈ X × Y q x, * ,z = b z x,z for all (x, z) ∈ X × Z q x,y,z ≥ 0 for all (x, y, z) This is a Convex Program with a non-linear, convex continuous objective function q → MI (x,y,z)∼q (x : y, z), which is smooth in all points q with no zero entry.The feasible region (i.e., the set of all q satisfying the constraints) is a compact convex polyhedron, which is nonempty, as the distribution of the original random variables (X, Y, Z) is a feasible solution.Clearly, the condition that q should be a probability distribution is implied by it satisfying the marginal equations.
In the next section, we review some background about convex functions and convex optimization, in particular with respect to non-smooth functions.Section 3 aims to shed light on the theoretical properties of the convex program (CP).In Section 4, we present the computer experiments which we conducted and their results.Some result tables are in the Appendix A.

Convex Optimization Basics
Since the target audience of this paper is not the optimization community, for the sake of easy reference, we briefly review the relevant definitions and facts, tailored for our problem: the feasible region is polyhedral, the objective function is (convex and continuous but) not everywhere smooth.
Let f be a convex function defined on a convex set C ⊆ R n ; let x ∈ C and d ∈ R n .The directional derivative of f at x in the direction of d is defined as the limit Remark 1 ([8], Lemma 2.71).The limit in (1) exists in [−∞, ∞], by the convexity of f .
There can be many subgradients of f at a point.However: is the only subgradient of f at x.
We state the following lemma for convenience (it will be used in the proof of Proposition 2).It is a simplification of the Moreau-Rockafellar Theorem ([8], Theorem 2.85).

Lemma 2. Let
If, for i = 1, . . ., k, g i is a subgradient of f i at x i , then g := (g 1 , . . ., g k ) is a subgradient of f at x.Moreover, all subgradients of f at x are of this form.
For the remainder of this section, we reduce the consideration to convex sets C of the form given in the optimization problem (CP).Let A be an m × n matrix, and b an m-vector.Assume that Clearly, if a feasible descent direction of f at x exists, then x is not a minimum of f over C. The following theorem is a direct consequence of ( [8], Theorem 3.34) adapted to our situation.
Theorem 1 (Karush-Kuhn-Tucker). Suppose that for every j = 1, . . ., n, there is an x ∈ C with x j > 0. The function f attains a minimum over C at a point x ∈ C if, and only if, there exist

•
a subgradient g of f at x • and an m-vector λ ∈ R m such that A λ ≤ g, and for all j ∈ {1, . . ., n} with x j > 0, equality holds: (A λ) j = g j .
The condition "x j = 0 or (A λ) j = g j " is called complementarity.

Convex Optimization Through Interior Point Methods
For reasons which will become clear in Section 3, it is reasonable to try to solve (CP) using a so-called Interior Point or Barrier Method (we gloss over the (subtle) distinction between IPMs and BMs).The basic idea of these iterative methods is that all iterates are in the "interior" of something.
We sketch what goes on.
From a theoretical point of view, a closed convex set C is given in terms of a barrier function, i.e., a convex function F : C → R with F(x) → ∞ whenever x approaches the boundary of C. The prototypical example is C := [0, ∞] n , and F(x) := − ∑ j ln x j .The goal is to minimize a linear objective function x → ∑ j c j x j over C.
To fit problems with nonlinear (convex) objective f into this paradigm, the "epigraph form" is used, which means that a new variable s is added along with a constraint f (q) ≤ s, and the objective is "minimize s".
The algorithm maintains a barrier parameter t > 0, which is increased gradually.In iteration k, it solves the unconstrained optimization problem The barrier parameter is then increased, and the algorithm proceeds with the next iteration.The fact that the barrier function tends to infinity towards the boundary of C makes sure that the iterates x (k) stay in the interior of C for all k.
The unconstrained optimization problem (2) is solved using Newton's method, which is itself iterative.If x is the current iterate, Newton's method finds the minimum to the second order Taylor expansion about x of the objective function g : x → t k • c x + F(x): where Hg(x) denotes the Hessian of g at x. Then Newton's method updates x (e.g., by simply replacing it with the argmin).Note that Hg(x) = HF(x), so the convergence properties (as well as, whether it works at all or not), depend on the properties of the barrier function.Suitable barrier functions are known to exist for all compact convex sets [9,10].However, for a given set, finding one which can be quickly evaluated (along with the gradient and the Hessian) is sometimes a challenge.
A more "concrete" point of view is the following.Consider f and C as above: C := {x | Ax = b, x ≥ 0} and f is convex and continuous on C, and smooth in the interior of C (here: the points x ∈ C with x j > 0 for all j) which we assume to be non-empty.A simple barrier-type algorithm is the following.In iteration k, solve the equality-constrained optimization problem The equality-constrained problem is solved by a variant of Newton's method: If x is the current iterate, Newton's method finds the minimum to the second order Taylor expansion about x of the function g : x → t k f (x) − ∑ ln(x j ) subject to the equations Ax = b, using Lagrange multipliers, i.e., it solves the linear system , where Diag(•) denotes a diagonal matrix of appropriate size with the given diagonal.
By convexity of f , H f (x) is positive semidefinite, so that adding the diagonal matrix results in a (strictly) positive definite matrix.Hence, the system of linear equations always has a solution.
The convergence properties of this simple barrier method now depend on properties of f .We refer the interested reader to ( [11], Chapter 11) for the details.
Generally speaking, today's practical Interior Point Methods are "Primal-Dual Interior Point Methods": They solve the "primal" optimization problem and the dual-the system in Theorem 1-simultaneously.After (successful) termination, they return not only an ε-approximation x of the optimum, but also the Lagrange multipliers λ which certify optimality.

Theoretical View on Bertschinger et al.'s Convex Program
We will use the following running example in this section.Among all q ∈ R X×Y×Z (16 dimensions) satisfying the 16 equations q x,y, * = b y x,y ∀x, y, q x, * ,z = b z x,z ∀x, z, and the 16 inequalities q x,y,z ≥ 0 for all x, y, z, we want to find one which minimizes MI (x,y,z)∼q (x : y, z).

The Feasible Region
In this section, we discuss the feasible region: ¶(b We will omit the "(b)" when no confusion can arise.We will always make the following assumptions: 1.
b y and b z are probability distributions.2.
¶(b) is non-empty.This is equivalent to b y x, * = b z x, * for all x ∈ X.

3.
No element of X is "redundant", i.e., for every x ∈ X we have both b y x, * > 0 and b z x, * > 0. First of all, recall the vectors d ∈ R X×Y×Z from Appendix A.1 of [1], defined by dx,y,z,y ,z := 1 x,y,z + 1 x,y ,z − 1 x,y,z − 1 x,y ,z , ( (where 1 ... is the vector with exactly one non-zero entry in the given position, and 0 otherwise) satisfy dx,y, * = dx, * ,z = 0 for all x, y, z (we omit the superscripts for convenience, when possible).
Our first proposition identifies the triples (x, y, z) for which the equation q x,y,z = 0 holds for all q ∈ ¶(b).Proposition 1.If q x,y,z = 0 holds for all q ∈ ¶(b), then b y x,y = 0 or b z x,z = 0.
Proof.Let x, y, z ∈ X × Y × Z, and assume that b y x,y = 0 and b z x,z = 0. Take any p ∈ ¶(b) with p x,y,z = 0-if no such p exists, we are done.Since the marginals are not zero, there exist y ∈ Y \ {y} and z ∈ Z \ {z} with p x,y ,z > 0 and p x,y,z > 0. Let d := dx,y,z,y ,z .By the remarks preceding the proposition, p + δ d satisfies the marginal equations for all δ ∈ R. Since p x,y ,z > 0 and p x,y,z > 0, there exists a δ > 0 such that q := p + δ d ≥ 0. Hence, we have q ∈ ¶(b) and q x,y,z > 0, proving that the equation q x,y,z = 0 is not satisfied by all elements of ¶(b).
The proposition allows us to restrict the set of variables of the Convex Program to those triples (x, y, z) for which a feasible q exists with q x,y,z = 0 (thereby avoiding unneccessary complications in the computation of the objective function and derivatives; see the next section); the equations with RHS 0 become superfluous.We let denote the set of remaining triplets.We will omit the "(b)" when no confusion can arise.
x,y > 0} is non-empty, by assumption 3; the same is true for Lemmas 26 and 27 of [1] now basically complete the proof of our corollary: It implies that the dimension is equal to As J (b) is the disjoint union of the sets {x} × Y x × Z x , x ∈ X, we find this quantity to be equal to which concludes the proof of the corollary.

The Objective Function and Optimality
We now discuss the objective function.The goal is to minimize Since the distribution of x is fixed by the marginal equations, the first term in the sum is a constant, and we are left with minimizing negative conditional entropy.We start by discussing negative conditional entropy as a function on its full domain (We don't assume q to be a probability distrbution, let alone to satisfy the marginal equations): where we set 0 ln(. . . ) := 0, as usual.The function f is continuous on its domain, and it is smooth on [0, ∞] J .Indeed, we have the gradient ∇ f (q) x,y,z and the Hessian It is worth pointing out that the Hessian, while positive semidefinite, is not positive definite, and, more pertinently, it is not in general positive definite on the tangent space of the feasible region, i.e., r H f (q)r = 0 is possible for r ∈ R J with r(x, y, * ) = r(x, * , z) = 0 for all (x, y, z).Indeed, if, e.g., J = X × Y × Z, it is easy to see the kernel of H f (q) has dimension |Y × Z|, whereas the feasible region has dimension then for every q(!) the kernel of H f (q) the must have a non-empty intersection with the tangent space of the feasible region.
Optimality condition and boundary issues.In the case of points q which lie on the boundary of the domain, i.e., q x,y,z = 0 for at least one triplet (x, y, z), some partial derivatives don't exist.For y, z, denote X yz := {x | (x, y, z) ∈ J }.The situation is as follows.
If there is a (x, y, z) ∈ J with q x,y,z = 0 but q * ,y,z > 0, then f does not have a subgradient at q. Indeed, there is a feasible descent direction of f at q with directional derivative −∞.

(b)
Otherwise-i.e., for all (x, y, z) ∈ J , q x,y,z = 0 only if q * ,y,z = 0-subgradients exist.For all y, z, let y,z ∈ [0, 1] X yz be a probability distribution on X yz .Suppose that, for all y, z with q * ,y,z > 0, y,z x = q x,y,z q * ,y,z for all x ∈ X yz .
Then g defined by g x,y,z := ln( y,z x ), for all (x, y, z) ∈ J , is a subgradient of f at q.Moreover, g is a subgradient iff there exists such a g with • g x,y,z ≤ g for all (x, y, z) ∈ J with q x,y,z = 0; • g x,y,z = g for all (x, y, z) ∈ J with q x,y,z > 0.
Proof.For Proposition 2, let (y, z) ∈ Y × Z with q * ,y,z > 0, and x ∈ X yz with q x,y,z = 0.There exist y , z such that q x,y ,z , q x,y,z > 0. This means that d := dx,y,z,y ,z as defined in (4) is a feasible direction.Direct calculations (written down in [12]) show that f (q; d) = −∞.Invoking Lemma 1 yields non-existence of the subgradient.As to Proposition 2, we prove the statement for every pair (y, z) ∈ Y × Z for the function and then use Lemma 2.
Let us fix one pair (y, z).If q x,y,z > 0 for all x ∈ X yz holds, then we f y,z is differentiable at q, so we simply apply Remark 2. Now assume q * ,y,z = 0.A vector g ∈ R X yz is a subgradient of f yz , iff ∑ x∈X yz r x,y,z ln(r x,y,z /r * ,y,z ) = f yz (q + r) holds for all r ∈ R J with r x,y,z ≥ for all (x, y, z) ∈ J , and r * > 0.
We immediately deduce g x ≤ 0 for all x.Let x := e g x for all x, and x := x /C, with C := * .Clearly, is a probability distribution on X yz .Moreover, the difference LHS-RHS of ( 10) is equal to D KL (r/r * ρ) + ln C, with D KL denoting Kullback-Leibler divergence.From the usual properties of the Kullback-Leibler divergence, we see that this expression is greater-than-or-equal to 0 for all r, if and only if C ≥ 1, which translates to From this, the statements in Proposition 2 follow.
From the proposition, we derive the following corollary.
Corollary 2. Suppose a minimum of f over ¶(b) is attained in a point q with q x,y,z = 0 for a triple (x, y, z) with b y x,y > 0 and b z x,z > 0. Then q u,y,z = 0 for all u ∈ X.
Proof.This follows immediately from the fact, expressed in item Proposition 2 of the lemma, that a negative feasible descent direction exists at a point q which with q x,y,z = 0 for a triple (x, y, z) ∈ J (b) with q * ,y,z > 0.
Based on Proposition 2 and the Karush-Kuhn-Tucker conditions, Theorem 1, we can now write down the optimality condition.Corollary 3. Let q ∈ ¶(b).The minimum of f over ¶(b) is attained in q if, and only if, (a) q * ,y,z = 0 holds whenever there is an (x, y, z) ∈ J (b) with q x,y,z = 0; and (b) there exist satisfying the following: For all y, z with q * ,y,z > 0, λ x,y + µ x,z = ln q x,y,z q * ,y,z holds for all x ∈ X yz ; for all y, z with q * ,y,z = 0 (but X yz = ∅), there is a probability distribution ∈ ]0, 1] X yz on X yz such that x ) holds for all x ∈ X yz .
From the conditions on the zero-nonzero pattern of an optimal solution, we can readily guess an optimal q.Let's guess wrongly, first, though: q 2,y,z = q 3,y,z = 1/16 for all y, z.In this case, all of the constraints in (11) become equations, and the 's don't occur.We have a system of equations in the variables λ •,• and µ •,• .It is easy to check that the system does not have a solution.

Algorithmic Approaches
We now discuss several possibilities of solving the convex program (CP): Gradient descent and interior point methods; geometric programming; cone programming over the so-called exponential cone.We'll explain the terms in the subsections.

Gradient Descent
Proposition 2 and Corollary 2 together with the running example make clear that the boundary is "problematic".On the one hand, the optimal point can sometimes be on the boundary.(In fact, this is already the case for the AND-gate, as computed in Appendix A.2 of [1].)On the other hand, by Corollary 2, optimal boundary points lie in a lower dimensional subset inside the boundary (codimension ≈ |X|), and the optimal points on the boundary are "squeezed in" between boundary regions which are "infinitely strongly repellent" (which means to express that the feasible descent direction has directional derivative −∞).
From the perspective of choosing an algorithm, it is pertinent that subgradients do not exist everywhere on the boundary.This rules out the use of algorithms which rely on evaluating (sub-)gradients on the boundary, such as projected (sub-)gradient descent.(And also generic active set and sequential quadratic programming methods (We refer the interested reader to [13] for background on these optimization methods); the computational result in Section 4 illustrate that.
Due to the huge popularity of gradient descent, the authors felt under social pressure to present at least one version of it.Thus, we designed an ad-hoc quick-and-dirty gradient descent algorithm which does its best to avoid the pitfalls of the feasible region: it's boundary.We now describe this algorithm.
Denote by A the matrix representing the LHS of the equations in (CP); also reduce A by removing rows which are linear combinations of other rows.Now, multiplication by the matrix P := A (A A) −1 A amounts to projection onto the tangent space {d | Ad = 0} of the feasible region, and P∇ f (q) is the gradient of f in the tanget space, taken at the point q.
The strategy by which we try to avoid the dangers of approaching the boundary of the feasible region in the "wrong" way is by never reducing the smallest entry of the current iterate q by more then 10%.Here is the algorithm.Compute f (q) if f (q) better than all previous solutions then 6 store q

7
Compute the gradient ∇ f (q) 8 Compute the projection of the gradient g := P∇ f (q) 9 Determine a step size η, ensuring q xyz > ηg xyz for all xyz 10 Update q = q − ηg 11 until stopping criterion is reached There are lots of challenges with this approach, not the least of which is deciding on the step size η.Generally, a good step size for gradient descent is 1 over the largest eigenvalue of the Hessian-but the eigenvalues of the Hessian tend to infinity.
The stopping criterion is also not obvious: we use a combination of the norm of the projected gradient, the distance to the boundary, and a maximum of 1000 iterations.
None of these decisions are motivated by careful thought.

Interior Point Methods
Using Interior Point Methods (IPMs) appears to be the natural approach: While the iterates can converge to a point on the boundary, none of the iterates actually lie on the boundary, and that is an inherent property of the method (not a condition which you try to enforce artificially as in the gradient descent approach of the previous section).Consequently, problems with gradients, or even non-existing subgradients, never occur.
Even here, however, there are caveats involving the boundary.Let us consider, as an example, the simple barrier approach sketched in Section 2.1 (page 4).The analysis of this method (see [11]) requires that the function F : q → t f (q) − ∑ xyz ln(q xyz ) be self-concordant, which means that, for some constant C, for all q, h D where D 3 denotes the tensor of third derivatives.The following is proven in [12]: Proposition 3. Let n ≥ 2, and consider the function There is no C and no t such that (12) holds for all q ∈ [0, ∞] n and all h.
The proposition explains why, even for some IPMs, approaching the boundary can be problematic.We refer to [12] for more discussions about self-concordancy issues of the (CP).Proof.This follows immediately from the fact that a 3-self concordant barrier exists for (CP) (see [12]) and the complexity analysis of barrier method in [11], Chapter 11.
Still, we find the IPM approach (with the commercial Interior Point software "Mosek") to be the most usable of all the approaches which we have tried.

Geometric Programming
Geometric Programs form a sub-class of Convex Programs; they are considered to be more easily solvable than general Convex Programs: Specialized algorithms for solving Geometric Programs have been around for a half-century (or more).We refer to [14] for the definition and background on Geometric Programming.The Langrange dual of (CP) can be written as the following Geometric Program in the variables λ ∈ R X×Y , µ ∈ R X×Y (for simplicity, assume that all the right-hand-sides b are strictly positive): (GP)

Exponential Cone Programming
Cone Programming is a far-reaching generalization of Linear Programming, which may contain so-called generalized inequalities: For a fixed closed convex cone K, the generalized inequality "a ≤ K b" simply translates into b − a ∈ K.There is a duality theory similar to that for Linear Programming.
Efficient algorithms for Cone Programming exist for some closed convex cones; for example for the exponential cone [15], K exp , which is the closure of all triples (r, p, q) ∈ R 3 satisfying q > 0 and qe r/q ≤ p.
For q > 0, the condition on the right-hand side is equivalent to r ≤ q ln(p/q), from which it can be easily verified that the following "Exponential Cone Program" computes (CP).The variables are r, q, p ∈ R X×Y×Z .maximize ∑ x,y,z r x,y,z subject to q x,y, * = b y x,y for all (x, y) ∈ X × Y q x, * ,z = b z x,z for all (x, z) ∈ X × Z q * ,y,z − p x,y,z = 0 for all (x, y, z) ∈ X × Y × Z (r x,y,z , q x,y,z , p x,y,z ) ∈ K exp for all (x, y, z) ∈ X × Y × Z. (EXP) The first two constraints are just the marginal equations; the third type of equations connects the p-variables with the q-variables, and the generalized inequality connects these to the variables forming the objective function.
There are in fact several ways of modeling (CP) as an Exponential Cone Program; here we present the one which is most pleasant both theoretically (the duality theory is applicable) and in practice (it produces the best computational results).For the details, as well as for another model, we refer to [12].

Computational Results
In this section, we present the computational results obtained by solving Bertschinger et al.'s Convex Program (CP) on a large number of instances, using readily available software out-of-the-box.First we discuss the problem instances, then the convex optimization solvers, then discuss the results.Detailed tables are in the Appendix A.

Data
In all our instances, the vectors b y and b z are marginals computed form an input probability distribution p on X × Y × Z. Occasionally, "noise" (explained below) in the creation p leads to the phenomenon that the sum of all probabilities is not 1.This was not corrected before computing b y and b z , as it is irrelevant for the Convex Program (CP).
We have three types of instances, based on (1) "gates"-the "paradigmatic examples" listed in Table 1 of  (1) Gates.The instances of the type (1) are based on the "gates" (RDN, UNQ, XOR, AND, RDNXOR, RDNUNQXOR, XORAND) described Table 1 of [1]: Each gate gives rise to two sets of instances: (1a) the "unadulterated" probability distribution computed from the definition of the gate (see Table A1); (1b) empirical distributions generated by randomly sampling from W and computing (x, y, z).In creating the latter instances, we incorporate "noise" by perturbing the output randomly with a certain probability.We used 5 levels of noise, corresponding to increased probabilities of perturbing the output.For example, AND 0 refers to the empirical distribution of the AND gate without perturbation (probability = 0), AND 4 refers to the empirical distribution of the AND gate with output perturbed with probability 0.1.The perturbation probabilities are: 0, 0.001, 0.01, 0.05, 0.1.(See Tables A2 and A3).
(2) Example-31 instances.In Appendix A.2 of their paper, Bertschinger et al. discuss the following input probability distribution: X, Y, Z are independent uniformly random in {0, 1}.They present this as an example that the optimization problem can be "ill-conditioned".We have derived a large number of instances based on that idea, by taking (X, Y, Z) uniformly distributed on X × Y × Z, with |X| ranging in {2, . . ., 5} and |Y| = |Z| in {|X|, . . ., 5 • |X|}.These instances are referred to as "independent" in Table A4.We also created "noisy" versions of the probability distributions by perturbing the probabilities.In the results, we split the instances into two groups according to whether the fraction |Y|/|X| is at most 2 or greater than 2. (The rationale behind the choice of the sizes of |X|, |Y|, |Z| is the fact mentioned in Section 3.2, that the Hessian has a kernel in the tangent space if the ratio is high.)(3) Discretized gaussians.We wanted to have a type of instances which was radically different from the somewhat "combinatorial" instances ( 1) and ( 2).We generated (randomly) twenty 3 × 3 covariance matrices between standard Gaussian random variables X, Y, and Z.We then discretized the resulting continuous 3-dimensional Gaussian probability distribution by integrating numerically (We used the software Cuba [16] in version 4.2 for that) over boxes [0, 1] 3 , [0, 0.75] 3 , [0, 0.5] 3 , [0, 0.25] 3 ; and all of their translates which held probability mass at least 10 −20 .
We grouped these instances according to the number of variables, see Table A5: The instances in the "Gauss-1" group are the ones with at most 1000 variables; "Gauss-2" have between 1000 and 5000; "Gauss-3" have between 5000 and 250,000, "Gauss-4" between 25,000 and 75,000.

Convex Optimization Software We Used
For our computational results, we made an effort to use as many software toolboxes as we could get our hands on.(The differences in the results are indeed striking.)Most of our software was coded in the Julia programming language (version 0.6.0)using the "MathProgBase" package see [17] which is part of JuliaOpt.(The only the CVXOPT interior point algorithm and for the Geometric Program did we use Python.)On top of that, the following software was used.

•
CVXOPT [18] is written and maintained by Andersen, Dahl, and Vandenberghe.It transforms the general Convex Problems with nonlinear objective function into an epigraph form (see Section 2.1), before it deploys an Interior Point method.We used version 1.1.9.

•
Artelys Knitro [19] is an optimization suite which offers four algorithms for general Convex Programming, and we tested all of them on (CP).The software offers several different convex programming solvers: We refer by "Knitro_Ip" to their standard Interior Point Method [20]; by "Knitro_IpCG" to their IPM with conjugate gradient (uses projected cg to solve the linear system [21]); "Knitro_As" is their Active Set Method [22]; "Knitro_SQP" designates their Sequential Quadratic Programming Method [19].We used version 10.2.

•
Mosek [23] is an optimization suite which offers algorithms for a vast range of convex optimization problems.Their Interior Point Method is described in [24,25].We used version 8.0.

•
Ipopt [20] is a software for nonlinear optimization which can also deal with non-convex objectives.At its heart is an Interior Point Method (as the name indicates), which is enhanced by approaches to ensure convergence even in the non-convex case.We used version 3.0.• ECOS.We are aware of only two Conic Optimization software toolboxes which allow to solve Exponential Cone Programs: ECOS and SCS.ECOS is a lightweigt numerical software for solving convex cone programs [26], using an Interior Point approach.We used the version from Nov 8, 2016.• SCS [27,28] stands for Splitting Conic Solver.It is a numerical optimization package for solving large-scale convex cone problems.It is a first-order method, and generally very fast.We used version 1.2.7.

Results
We now discuss the computational results of every solver.The tables in Appendix A give three data points for each pair of instance (group) and solver: Whether the instance was solved, or, respectively, which fraction of the instances in the group were solved ("solved"); the time ("time") or, respectively, average time "avg.tm" needed (All solvers had a time limit 10,000.0s.We used a computer server with Intel(R) Core(TM) i7-4790K CPU (4 cores) and 16GB of RAM) to solve it/them.The third we call the "optimality measure": When a solver reports having found (optimal) primal and dual solutions, we computed:

•
the maximum amount by which any of the marginal equations is violated; • the maximum amount by which any of the nonnegativity inequaities is violated; • the maximum amount by which the inequality "A λ ≤ g" in Theorem 1 is violated; • the maximum amount by which the complementarity x j A λ) j − g j is violated.
Of all these maxima we took that maximum to have an indication of how reliable the returned solution is, with respect to feasibility and optimality.In the tables, the maximum is then taken over all the instances in the group represented in each row of the table.We chose this "worst-case" approach to emphasize the need for a reliable software.

Gradient Descent
The Gradient Descent algorithm sketched in the previous section finds its limits when the matrix P becomes too large to fit into memory, which happens for the larger ones of the Gaussian instances.
Even before that, the weakness of the algorithm are clear in the fact that the best solution it finds is often not optimal (the other methods produce better solutions).The running times are mid-field compared to the other methods.
It is probable that a more carefully crafted gradient descent would preform better, but the conceptual weaknesses of the approach remain.

Interior Point algorithms
Cvxopt It terminated with optimal solution on all the noiseless discrete gates except "RDN"see Tables A1-A3.But whenever the noise was added it started failing except on "XOR".So as the number of samples increased even when decreasing the noise it still failed.The optimality measure was always bad except on some of the noiseless gates.Mainly the dual solution was weakly feasible on the solved ones.It bailed out 56% due to KKT system problems and the rest was iteration and time limit.
For the uniformly binary gates again Cvxopt failed on the noisy distributions with 34% success on Noisy 1 and 0% on Noisy 2. It solved all the independent correctly with acceptable optimality measure.
For the Gaussian, it totally failed with 40% computational problems in solving the KKT system and 38% time limit and the rest were iteration limit.The solver was slowest compared to others and required high memory to store the model.
Knitro_Ip It terminated with optimal solution on all the discrete gates with good optimality measure see Tables A1-A4.But for the Gaussian, it failed most of the time.It solved only 5% of all Gaussian instances where 75% of the solved have less than 1000 variable Table A5.25% of the unsolved instances couldn't get a dual feasible solution.The rest stopped due to iteration or time limit but still they had either infeasible dual or the KKT conditions were violated.
Kintro_Ip was a little bit slower than Mosek on most of the discrete gates.On the Gaussian, it very slow.For most of the instances, Mosek had its optimality measure 1000 times better than Knitro_Ip.Even though it worked well for the discrete gates we can't rely on it for this problem since it is not stable with the variables number.
Knitro IpCG.It terminated with optimal solution on all the discrete gates except "XOR" and "RDNXOR" see Tables A2 and A3.For those two particular gates, it reached the iteration limit.For the Gaussian, it did a little better than Knitro_Ip since the projected conjugate gradient helps with large scale problems.It was able to solve only 6.2% none of which had less than 1000 variables and with 50% having more than 8000 variables see Table A5.25% of the unsolved instances couldn't get a dual feasible solution.The rest stopped due to iteration or time limit with similar optimality measure violation as Knitro_Ip.
Knitro_IpCG was considerably slower than Mosek and Knitro_Ip.And it had worse optimality measure than Mosek on most of its solved instances.Since Knitro_IpCG couldn't solve a big set of instances this solver is unreliable for this problem.
Mosek terminated with reporting an optimal solution almost on all instances.For the gates, with sufficient optimality measure see Tables A1-A4.For the Gaussian gates, it terminated optimally for 70% of the instances see Table A5.25% bailed out mainly since the gap between the primal and dual is negligible with the dual solution and primal are weakly feasible.Most of the latter instances had more than 10,000 variables.For the last 5% Mosek reached its iteration limit.
For all discrete gates Mosek terminated within milliseconds except "RDNUNQXOR" it terminated in at most 1 s.For the Gaussians only those with more than 25,000 variables took up to 5 s to be solved, the rest was done in less than a second.
Ipopt failed on the unadulterated "AND", "UNQ", and "XORAND" distributions since the optimal solution lays on the boundary and Ipopt hits the boundary quit often.Also, it couldn't solve any of the noiseless with sampling "AND" and "UNQ" gates, but managed to solve 30% of those of "XORAND".When the noise was applied, it worked 90% with some noisy "RDNXOR" gates, but bailed out on "RDNUNQXOR" gate with 10%, 5%, and 1%.Note that the optimality measure was highly violated even when Ipopt terminates optimally see Tables A1-A4 The violation mainly was in the dual feasibility and for the instances which had the solution on the boundary.
With the Gaussians, Ipopt couldn't solve any instance see Table A5.In 55% of the instances it terminated with computational errors or search direction being to small.25% terminated with iteration limit and the rest with time limit.Same as for discrete case the optimality measure was not good.For discrete gates Ipopt was as fast a Mosek.Overall Ipopt is an unreliable solver for this problem.

Geometric Programming: CVXOPT
CVXOPT has a dedicated Geometric Programming interface, which we used to solve the Geometric Program (GP).The approach cannot solve a single instance to optimality: CVXOPT always exceeds the limit on the number of iterations.Hence, we have not listed it in the tables.

Exponential Cone Programming: ECOS & SCS
SCS failed on each and every instance.For ECOS, however, on average the running times were fast, but the results were mildly good.For types 1 and 2 instances, ECOS was most of the time the fastest solver.It terminated optimally for all the instances of types 1 and 2.
ECOS terminated suboptimal on 90% of the Gaussian instances.On the rest of the Gaussian gates it ran into numerical problems, step size became negligible, and terminated with unfeasible solutions.
ECOS handled types 1 and 2 instances pretty well.It was mostly slow on the Gaussian instances.For example, some Gaussian instances took ECOS more than seven hours to terminate sub-optimally.Modeling the problem as an exponential cone programming seems to be a good choice.ECOS vigorous performance makes it a reliable candidate for this optimization problem.

Miscellaneous Approaches
We include the following in the computational results for completeness.From what we discussed in the previous section, it is no surprise at all that these methods perform badly.
Knitro_SQP failed on the unadulterated "AND" and "RDNUNQAND" distributions see Table A1.When the noise was added to the discrete gates it couldn't solve more than 35% per gate see Tables A2  and A3.It didn't have better optimality measure than Mosek except on "RDN 1" which is can't be built on since Knitro_SQP solved only 8% of those instances.Note that on the unsolved instances, Knitro_SQP was the only solver which gave wrong answers (30% of the unsolved) i.e., claimed that the problem is infeasible.
Similarly, on Gaussian it failed in 78% of the instances due to computational problems and the rest were iteration or time limit see Table A5.This confirms that such type of algorithms is unsuitable for our problem; see Section 3.
Knitro_As.It failed on all the instances, trying to evaluate the Hessian at boundary points.

Conclusions
A closer look at the Convex Program proposed by Bertschinger et al. [1] reveals some subtleties which makes the computation of the information decomposition challenging.We have tried to solve the convex program using a number of software packages out-of-the-box (including, BTW, [29]).
Two of the solvers, namely, Mosek and ECOS work very satisfactorily.Even though, ECOS sometimes was 1000 times faster than Mosek, on some of types 1 and 2 instances, the time difference was no more than 5 s making this advantage rather dispensable.Nevertheless, on these instances, Mosek had better optimality measures.Note that on the hard problems of type 1, namely, "RDNUNQXOR", Mosek was faster than ECOS see Figure 1.
On the other hand, ECOS was slower than Mosek on the Gaussian instances especially when the (CP) had a huge number of variables.For these instances, the time difference is significant (hours) see Figure 2, which is rather problematic for ECOS.
Hence each of the two solvers has its pros and cones.This means that using both solvers is an optimal strategy when approaching the problem.One suggestion would be to use ECOS as a surrogate when Mosek fails to give a solution.Earlier ad-hoc approaches, which were based on CVXOPT and on attempts to "repair" a situation when the solver failed to converge, appear to be redundant.
trying to improve these results.Denote by X the range of the random variable X, by Y the range of Y, and by Z the range of Z. Further, let b y x,y = P X = x ∧ Y = y for all x ∈ X, y ∈ Y b z x,z = P X = x ∧ Z = z for all x ∈ X, y ∈ Y the marginals.Then the Bertschinger et al. definition of synergistic information is CI(X : Y; Z) := MI(X : Y, Z) − min q MI (x,y,z)∼q (x : y, z),

Algorithm 1 :
Gradient Descent 1 Construct the matrix A 2 Compute P := A (A A) −1 A 3 Initialize q to a point in the interior of the feasible region 4 repeat 5

Corollary 4 .
The complexity of the interior point method via self-concordant barrier for solving (CP) is O(M • log 1 /ε), where M is the complexity of computing the Newton step (which can be done by computing and inverting the Hessian of the barrier) λ x,y + ∑ (x,z)∈X×Z b z x, z • µ x,z subject to ln ∑ x∈X exp λ xy + µ xz ≤ 0 for all (y, z) ∈ Y × Z.

Figure 1 .
Figure 1.Comparison Between the running times of Mosek and ECOS for problems whose number of variables is at below 10,500.

Figure 2 .
Figure 2. Comparison Between the running times of Mosek and ECOS for problems whose number of variables is at least 16,000.