Next Article in Journal / Special Issue
Multivariate Dependence beyond Shannon Information
Previous Article in Journal
How Can We Fully Use Noiseless Feedback to Enhance the Security of the Broadcast Channel with Confidential Messages
Previous Article in Special Issue
Coarse-Graining and the Blackwell Order

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# Bivariate Partial Information Decomposition: The Optimization Perspective

Institute of Computer Science, University of Tartu, 51014 Tartu, Estonia
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(10), 530; https://doi.org/10.3390/e19100530
Received: 7 July 2017 / Revised: 21 September 2017 / Accepted: 28 September 2017 / Published: 7 October 2017

## Abstract

:
Bertschinger, Rauh, Olbrich, Jost, and Ay (Entropy, 2014) have proposed a definition of a decomposition of the mutual information $M I ( X : Y , Z )$ into shared, synergistic, and unique information by way of solving a convex optimization problem. In this paper, we discuss the solution of their Convex Program from theoretical and practical points of view.

## 1. Introduction

Bertschinger, Rauh, Olbrich, Jost, and Ay [1] have proposed to compute a decomposition of the mutual information $M I ( X : Y , Z )$ into shared, synergistic, and unique contributions by way of solving a convex optimization problem. It is important to mention that William and Beer in [2] were the first to propose a measure for information decomposition. That measure suffered from serious flaws, which prompted a series of other papers [3,4,5] trying to improve these results. Denote by X the range of the random variable $X$, by Y the range of $Y$, and by Z the range of $Z$. Further, let
the marginals. Then the Bertschinger et al. definition of synergistic information is
$C I ˜ ( X : Y ; Z ) : = MI ( X : Y , Z ) − min q MI ( x , y , z ) ∼ q ( x : y , z ) ,$
where MI denotes mutual information, and $( x , y , z ) ∼ q$ stands for $( x , y , z )$ is distributed according to the probability distribution q. The minimum ranges over all probability distributions $q ∈ [ 0 , 1 ] X × Y × Z$ which satisfy the marginal equations
$q x , y , ∗ = b x , y y for all ( x , y ) ∈ X × Y q x , ∗ , z = b x , z z for all ( x , z ) ∈ X × Z .$
We use the notational convention that an asterisk “∗” is to be read as “sum over everything that can be plugged in here”; e.g.,
$q x , y , ∗ : = ∑ w q x , y , w .$
(We don’t use the symbol ∗ in any other context).
As pointed out in [1], the function $q ↦ MI ( x , y , z ) ∼ q ( x : y , z )$ is convex and smooth in the interior of the convex region, and hence, by textbook results on convex optimization, for every $ε > 0$, an $ε$-approximation of the optimum—i.e., a probability distribution $q ★$ for which $MI ( x , y , z ) ∼ q ★ ( x : y , z )$ is at most $ε$ larger than the minimum—can be found in at most $O ( α 2 / ε 2 )$ rounds of an iterative algorithm (e.g., Proposition 7.3.1 in [6]), with a parameter $α$ (the asphericity of the feasible region) depending on the marginals $b y$, $b z$. Each iteration involves evaluating the gradient of $MI ( x , y , z ) ∼ q ★ ( x : y , z )$ with respect to q, and additional arithmetic operations.
In Appendix A of their paper, Bertschinger et al. discuss practical issues related to the solution of the convex optimization problem: They analyze the feasible region, solve by hand some of the problems, and complain that Mathematica could not solve the optimization problem out of the box.
Our paper is an expansion of that appendix in [1]. Driven by the need in application areas (e.g., [7]) to have a robust, fast, out-of-the-box method for computing $C I ˜$, we review the required convex optimization background; discuss different approaches to computing $C I ˜$ and the theoretical reasons contributing to the poor performance of some of them; and finally compare several of these approaches on practical problem instance. In conclusion, we offer two different practical ways of computing $C I ˜$.
The convex optimization problem in [1] is the following:
$minimize MI ( x , y , z ) ∼ q ( x : y , z ) over q ∈ R X × Y × Z subject to q x , y , ∗ = b x , y y for all ( x , y ) ∈ X × Y q x , ∗ , z = b x , z z for all ( x , z ) ∈ X × Z q x , y , z ≥ 0 for all ( x , y , z ) ∈ X × Y × Z$
This is a Convex Program with a non-linear, convex continuous objective function $q ↦ MI ( x , y , z ) ∼ q ( x : y , z )$, which is smooth in all points q with no zero entry. The feasible region (i.e., the set of all q satisfying the constraints) is a compact convex polyhedron, which is nonempty, as the distribution of the original random variables $( X , Y , Z )$ is a feasible solution. Clearly, the condition that q should be a probability distribution is implied by it satisfying the marginal equations.
In the next section, we review some background about convex functions and convex optimization, in particular with respect to non-smooth functions. Section 3 aims to shed light on the theoretical properties of the convex program (CP). In Section 4, we present the computer experiments which we conducted and their results. Some result tables are in the Appendix A.

## 2. Convex Optimization Basics

Since the target audience of this paper is not the optimization community, for the sake of easy reference, we briefly review the relevant definitions and facts, tailored for our problem: the feasible region is polyhedral, the objective function is (convex and continuous but) not everywhere smooth.
Let f be a convex function defined on a convex set $C ⊆ R n$; let $x ∈ C$ and $d ∈ R n$. The directional derivative of f at x in the direction of d is defined as the limit
$f ′ ( x ; d ) : = lim ε ↘ 0 f ( x + ε d ) − f ( x ) ε ∈ [ − ∞ , ∞ ] .$
Remark 1
([8], Lemma 2.71). The limit in (1) exists in $[ − ∞ , ∞ ]$, by the convexity of f.
A subgradient of f at a point $x ∈ C$ is a vector $g ∈ R n$ such that, for all $y ∈ C$, we have $f ( y ) ≥ f ( x ) + g ⏉ ( y − x )$.
There can be many subgradients of f at a point. However:
Remark 2.
If f is differentiable at x, then $∇ f ( x )$ is the only subgradient of f at x.
Lemma 1
([8], Lemma 2.73). If g is a subgradient of f at $x ∈ C$, and $y ∈ C$, then $g ⏉ ( y − x ) ≤ f ′ ( x ; y − x )$.
We state the following lemma for convenience (it will be used in the proof of Proposition 2). It is a simplification of the Moreau-Rockafellar Theorem ([8], Theorem 2.85).
Lemma 2.
Let $C i ⊂ R ℓ$, $i = 1 , ⋯ , k$ be a closed convex sets, $f i : C i → R$ continuous convex functions, and
$f ( x ) = ∑ i = 1 k f i ( x i )$
for $x = ( x 1 , ⋯ , x k ) ∈ ∏ i = 1 k C i ⊂ R ( ℓ ) k$. If, for $i = 1 , ⋯ , k$, $g i$ is a subgradient of $f i$ at $x i$, then $g : = ( g 1 , ⋯ , g k )$ is a subgradient of f at x. Moreover, all subgradients of f at x are of this form.
For the remainder of this section, we reduce the consideration to convex sets C of the form given in the optimization problem (CP). Let A be an $m × n$ matrix, and b an m-vector. Assume that $C = { x ∈ R n ∣ A x = b ∧ x ≥ 0 }$. Suppose f is a convex function defined on C. Consider the convex optimization problem $min x ∈ C f ( x )$.
Let $x ∈ C$. A vector $d ∈ R n$ is called a feasible descent direction of f at x, if
(a)
for some $ε > 0$, $x + ε d ∈ C$ (“feasible direction”); and
(b)
$f ′ ( x ; d ) < 0$ (“descent direction”).
Clearly, if a feasible descent direction of f at x exists, then x is not a minimum of f over C. The following theorem is a direct consequence of ([8], Theorem 3.34) adapted to our situation.
Theorem 1 (Karush-Kuhn-Tucker).
Suppose that for every $j = 1 , ⋯ , n$, there is an $x ∈ C$ with $x j > 0$. The function f attains a minimum over C at a point $x ∈ C$ if, and only if, there exist
• a subgradient g of f at x
• and an m-vector $λ ∈ R m$
such that $A ⏉ λ ≤ g$, and for all $j ∈ { 1 , ⋯ , n }$ with $x j > 0$, equality holds: $( A ⏉ λ ) j = g j$.
The condition “$x j = 0$ or $( A ⏉ λ ) j = g j$” is called complementarity.

#### 2.1. Convex Optimization Through Interior Point Methods

For reasons which will become clear in Section 3, it is reasonable to try to solve (CP) using a so-called Interior Point or Barrier Method (we gloss over the (subtle) distinction between IPMs and BMs). The basic idea of these iterative methods is that all iterates are in the “interior” of something. We sketch what goes on.
From a theoretical point of view, a closed convex set C is given in terms of a barrier function, i.e., a convex function $F : C → R$ with $F ( x ) → ∞$ whenever x approaches the boundary of C. The prototypical example is , and $F ( x ) : = − ∑ j ln x j$. The goal is to minimize a linear objective function $x ↦ ∑ j c j x j$ over C.
To fit problems with nonlinear (convex) objective f into this paradigm, the “epigraph form” is used, which means that a new variable s is added along with a constraint $f ( q ) ≤ s$, and the objective is “minimize s”.
The algorithm maintains a barrier parameter $t > 0$, which is increased gradually. In iteration k, it solves the unconstrained optimization problem
The barrier parameter is then increased, and the algorithm proceeds with the next iteration. The fact that the barrier function tends to infinity towards the boundary of C makes sure that the iterates $x ( k )$ stay in the interior of C for all k.
The unconstrained optimization problem (2) is solved using Newton’s method, which is itself iterative. If x is the current iterate, Newton’s method finds the minimum to the second order Taylor expansion about x of the objective function $g : x ↦ t k · c ⏉ x + F ( x )$:
where $H g ( x )$ denotes the Hessian of g at x. Then Newton’s method updates x (e.g., by simply replacing it with the argmin).
Note that $H g ( x ) = H F ( x )$, so the convergence properties (as well as, whether it works at all or not), depend on the properties of the barrier function. Suitable barrier functions are known to exist for all compact convex sets [9,10]. However, for a given set, finding one which can be quickly evaluated (along with the gradient and the Hessian) is sometimes a challenge.
A more “concrete” point of view is the following. Consider f and C as above: $C : = { x ∣ A x = b , x ≥ 0 }$ and f is convex and continuous on C, and smooth in the interior of C (here: the points $x ∈ C$ with $x j > 0$ for all j) which we assume to be non-empty. A simple barrier-type algorithm is the following. In iteration k, solve the equality-constrained optimization problem
The equality-constrained problem is solved by a variant of Newton’s method: If x is the current iterate, Newton’s method finds the minimum to the second order Taylor expansion about x of the function $g : x ↦ t k f ( x ) − ∑ ln ( x j )$ subject to the equations $A x = b$, using Lagrange multipliers, i.e., it solves the linear system
$H g ( x ) y + A ⏉ λ = − ∇ g ( x ) A x = b ,$
with $H g ( x ) = t k H f ( x ) + Diag j ( 1 / x j 2 )$ and $∇ g ( x ) = t k ∇ f ( x ) − Diag j ( 1 / x j )$, where $D i a g ( · )$ denotes a diagonal matrix of appropriate size with the given diagonal.
By convexity of f, $H f ( x )$ is positive semidefinite, so that adding the diagonal matrix results in a (strictly) positive definite matrix. Hence, the system of linear equations always has a solution.
The convergence properties of this simple barrier method now depend on properties of f. We refer the interested reader to ([11], Chapter 11) for the details.
Generally speaking, today’s practical Interior Point Methods are “Primal-Dual Interior Point Methods”: They solve the “primal” optimization problem and the dual—the system in Theorem 1—simultaneously. After (successful) termination, they return not only an $ε$-approximation x of the optimum, but also the Lagrange multipliers $λ$ which certify optimality.

## 3. Theoretical View on Bertschinger et al.’s Convex Program

We will use the following running example in this section.
Example 1 (Part 1/4).
Let $X : = {$0, 1, 2, 3}, $Y , Z : = {$0, 1}.
$b 0 , 0 y : = 1 / 4 b 0 , 1 y : = 0 b 0 , 0 z : = 1 / 4 b 0 , 1 z : = 0 b 1 , 0 y : = 0 b 1 , 1 y : = 1 / 4 b 1 , 0 z : = 0 b 1 , 1 z : = 1 / 4 b 2 , 0 y : = 1 / 8 b 2 , 1 y : = 1 / 8 b 2 , 0 z : = 1 / 8 b 2 , 1 z : = 1 / 8 b 3 , 0 y : = 1 / 8 b 3 , 1 y : = 1 / 8 b 3 , 0 z : = 1 / 8 b 3 , 1 z : = 1 / 8$
Among all $q ∈ R X × Y × Z$ (16 dimensions) satisfying the 16 equations
$q x , y , ∗ = b x , y y ∀ x , y , q x , ∗ , z = b x , z z ∀ x , z ,$
and the 16 inequalities $q x , y , z ≥ 0$ for all $x , y , z$, we want to find one which minimizes $MI ( x , y , z ) ∼ q ( x : y , z )$.

#### 3.1. The Feasible Region

In this section, we discuss the feasible region:
We will omit the “$( b )$” when no confusion can arise. We will always make the following assumptions:
• $b y$ and $b z$ are probability distributions.
• $¶ ( b )$ is non-empty. This is equivalent to $b x , ∗ y = b x , ∗ z$ for all $x ∈ X$.
• No element of X is “redundant”, i.e., for every $x ∈ X$ we have both $b x , ∗ y > 0$ and $b x , ∗ z > 0$.
First of all, recall the vectors $d ¯ ∈ R X × Y × Z$ from Appendix A.1 of [1], defined by
$d ¯ x , y , z , y ′ , z ′ : = 1 x , y , z + 1 x , y ′ , z ′ − 1 x , y , z ′ − 1 x , y ′ , z ,$
(where $1 ⋯$ is the vector with exactly one non-zero entry in the given position, and 0 otherwise) satisfy $d ¯ x , y , ∗ = d ¯ x , ∗ , z = 0$ for all $x , y , z$ (we omit the superscripts for convenience, when possible).
Our first proposition identifies the triples $( x , y , z )$ for which the equation $q x , y , z = 0$ holds for all $q ∈ ¶ ( b )$.
Proposition 1.
If $q x , y , z = 0$ holds for all $q ∈ ¶ ( b )$, then $b x , y y = 0$ or $b x , z z = 0$.
Proof.
Let $x , y , z ∈ X × Y × Z$, and assume that $b x , y y ≠ 0$ and $b x , z z ≠ 0$. Take any $p ∈ ¶ ( b )$ with $p x , y , z = 0$—if no such p exists, we are done. Since the marginals are not zero, there exist $y ′ ∈ Y \ { y }$ and $z ′ ∈ Z \ { z }$ with $p x , y ′ , z > 0$ and $p x , y , z ′ > 0$. Let $d ¯ : = d ¯ x , y , z , y ′ , z ′$. By the remarks preceding the proposition, $p + δ d ¯$ satisfies the marginal equations for all $δ ∈ R$. Since $p x , y ′ , z > 0$ and $p x , y , z ′ > 0$, there exists a $δ > 0$ such that $q : = p + δ d ¯ ≥ 0$. Hence, we have $q ∈ ¶ ( b )$ and $q x , y , z > 0$, proving that the equation $q x , y , z = 0$ is not satisfied by all elements of $¶ ( b )$. ☐
The proposition allows us to restrict the set of variables of the Convex Program to those triples $( x , y , z )$ for which a feasible q exists with $q x , y , z ≠ 0$ (thereby avoiding unneccessary complications in the computation of the objective function and derivatives; see the next section); the equations with RHS 0 become superfluous. We let
denote the set of remaining triplets. We will omit the “$( b )$” when no confusion can arise.
Example 2 (Part 2/4).
Continuing Example 1, we see that the equations with RHS 0 allow to omit the variables $q x , y , z$ for the following triples $( x , y , z )$:
$( 0 , 1 , 0 ) ( 0 , 0 , 1 ) ( 0 , 1 , 1 ) ( 0 , 1 , 1 ) ( 1 , 0 , 0 ) ( 1 , 0 , 0 ) ( 1 , 0 , 1 ) ( 1 , 1 , 0 )$
Hence, $J : = { ($0, 0, 0$) , ($1, 1, 1$) , ($2$, y , z ) , ($3$, y ′ , z ′ ) ∣ y , z , y ′ , z ′ ∈ {$0, 1$} }$ (10 variables), and we are left with the following 12 equations:
$q 0 , 0 , ∗ = b 0 , 0 y = 1 / 4 q 0 , ∗ , 0 = b 0 , 0 z = 1 / 4 q 1 , 1 , ∗ = b 1 , 1 y = 1 / 4 q 1 , ∗ , 1 = b 1 , 1 z = 1 / 4 q 2 , 0 , ∗ = b 2 , 0 y = 1 / 8 q 2 , 1 , ∗ = b 2 , 1 y = 1 / 8 q 2 , ∗ , 0 = b 2 , 0 z = 1 / 8 q 2 , ∗ , 1 = b 2 , 1 z = 1 / 8 q 3 , 0 , ∗ = b 3 , 0 y = 1 / 8 q 3 , 1 , ∗ = b 3 , 1 y = 1 / 8 q 3 , ∗ , 0 = b 3 , 0 z = 1 / 8 q 3 , ∗ , 1 = b 3 , 1 z = 1 / 8 .$
We can now give an easy expression for the dimension of the feasible region (with respect to the number of variables, ).
Corollary 1.
The dimension of $¶ ( b )$ is
Proof.
For $x ∈ X$, the set $Y x : = { y ∈ Y ∣ b x , y y > 0 }$ is non-empty, by assumption 3; the same is true for $Z x : = { z ∈ Z ∣ b x , z z > 0 }$. Note that $J ( b )$ is the disjoint union of the sets ${ x } × Y x × Z x$, $x ∈ X$.
Lemmas 26 and 27 of [1] now basically complete the proof of our corollary: It implies that the dimension is equal to
As $J ( b )$ is the disjoint union of the sets ${ x } × Y x × Z x$, $x ∈ X$, we find this quantity to be equal to
Finally,
which concludes the proof of the corollary. ☐
Example 3 (Part 3/4).
Continuing Example 1, we find that the values for the variables $q 0 , 0 , 0$ and $q 1 , 1 , 1$ are fixed to 1/4, whereas each of two sets of four variables $q 2 , y , z$, $y , z ∈ {$0, 1}, and $q 3 , y , z$, $y , z ∈ {$0, 1}, offers one degree of freedom, so the dimension should be 2. And indeed: .

#### 3.2. The Objective Function and Optimality

We now discuss the objective function. The goal is to minimize
$MI ( x , y , z ) ∼ q ( x : y , z ) = H ( x , y , z ) ∼ q ( x ) − H ( x , y , z ) ∼ q ( x ∣ y , z ) .$
Since the distribution of x is fixed by the marginal equations, the first term in the sum is a constant, and we are left with minimizing negative conditional entropy. We start by discussing negative conditional entropy as a function on its full domain (We don’t assume q to be a probability distrbution, let alone to satisfy the marginal equations): f: $R + J → R$ (with $J = J ( b )$ for arbitrary but fixed b)
where we set $0 ln ( ⋯ )$ := 0, as usual. The function f is continuous on its domain, and it is smooth on $[ 0 , ∞ ] J$. Indeed, we have the gradient
and the Hessian
It is worth pointing out that the Hessian, while positive semidefinite, is not positive definite, and, more pertinently, it is not in general positive definite on the tangent space of the feasible region, i.e., $r ⏉ H f ( q ) r = 0$ is possible for $r ∈ R J$ with $r ( x , y , ∗ ) = r ( x , ∗ , z ) = 0$ for all $( x , y , z ) .$ Indeed, if, e.g., $J = X × Y × Z ,$ it is easy to see the kernel of $H f ( q )$ has dimension , whereas the feasible region has dimension Hence, if , then for every $q ( ! )$ the kernel of $H f ( q )$ the must have a non-empty intersection with the tangent space of the feasible region.
Optimality condition and boundary issues. In the case of points q which lie on the boundary of the domain, i.e., $q x , y , z = 0$ for at least one triplet $x , y , z$, some partial derivatives don’t exist. For $y , z$, denote $X y z : = { x | ( x , y , z ) ∈ J }$. The situation is as follows.
Proposition 2.
Let $q ∈ ¶$.
(a)
If there is a $( x , y , z ) ∈ J$ with $q x , y , z = 0$ but $q ∗ , y , z > 0$, the f does not have a subgradient at q. Indeed, there is a feasible descent direction of f at q with directional derivative $− ∞$.
(b)
Otherwise—i.e., for all $( x , y , z ) ∈ J$, $q x , y , z = 0$ only if $q ∗ , y , z = 0$—subgradients exist. For all $y , z$, let $ϱ y , z ∈ [ 0 , 1 ] X y z$ be a probability distribution on $X y z$. Suppose that, for all $y , z$ with $q ∗ , y , z > 0$,
$ϱ x y , z = q x , y , z q ∗ , y , z for all x ∈ X y z .$
Then g defined by $g x , y , z : = ln ( ϱ x y , z )$, for all $( x , y , z ) ∈ J$, is a subgradient of f at q.
Moreover, $g ′$ is a subgradient iff there exists such a g with
• $g x , y , z ′ ≤ g$ for all $( x , y , z ) ∈ J$ with $q x , y , z = 0$;
• $g x , y , z ′ = g$ for all $( x , y , z ) ∈ J$ with $q x , y , z > 0$;
Proof.
For Proposition 2, let $( y , z ) ∈ Y × Z$ with $q ∗ , y , z > 0$, and $x ∈ X y z$ with $q x , y , z = 0$. There exist $y ′ , z ′$ such that $q x , y ′ , z , q x , y , z ′ > 0$. This means that $d ¯ : = d ¯ x , y , z , y ′ , z ′$ as defined in (4) is a feasible direction. Direct calculations (written down in [12]) show that $f ′ ( q ; d ¯ ) = − ∞$. Invoking Lemma 1 yields non-existence of the subgradient.
As to Proposition 2, we prove the statement for every pair $( y , z ) ∈ Y × Z$ for the function
$f y z : R + X y z → R : q ↦ ∑ x ∈ X y z q x ln ( q x / q ∗ ) ,$
and then use Lemma 2.
Let us fix one pair $( y , z )$. If $q x , y , z > 0$ for all $x ∈ X y z$ holds, then we $f y , z$ is differentiable at q, so we simply apply Remark 2.
Now assume $q ∗ , y , z = 0$. A vector $g ∈ R X y z$ is a subgradient of $f y z$, iff
$∑ x ∈ X y z r x , y , z ln ( r x , y , z / r ∗ , y , z ) = f y z ( q + r ) ≥ ! f y z ( q ) + g ⏉ r = f y z ( q ) + ∑ x ∈ X y z r x g x$
holds for all $r ∈ R J$ with $r x , y , z ≥$ for all $( x , y , z ) ∈ J$, and $r ∗ > 0$.
We immediately deduce $g x ≤ 0$ for all x. Let $ϱ x ′ : = e g x$ for all x, and $ϱ x : = ϱ x ′ / C$, with $C : = ϱ ∗ ′$. Clearly $ϱ$ is a probability distribution on $X y z$. Moreover, the difference LHS-RHS of (10) is equal to
$D KL ( r / r ∗ ∥ ρ ) + ln C ,$
with $D KL$ denoting Kullback-Leibler divergence. From the usual properties of the Kullback-Leibler divergence, we see that this expression is greater-than-or-equal to 0 for all r, if and only if $C > 1$, which translates to
$∑ x e g x ≤ 1 .$
From this, the statements in Proposition 2 follow. ☐
From the proposition, we derive the following corollary.
Corollary 2.
Suppose a minimum of f over $¶ ( b )$ is attained in a point q with $q x , y , z = 0$ for a triple $( x , y , z )$ with $b x , y y > 0$ and $b x , z z > 0$. Then $q u , y , z = 0$ for all $u ∈ X$.
Proof.
This follows immediately from the fact, expressed in item Proposition 2 of the lemma, that a negative feasible descent direction exists at a point q which with $q x , y , z = 0$ for a triple $( x , y , z ) ∈ J ( b )$ with $q ∗ , y , z > 0$. ☐
Based on Proposition 2 and the Karush-Kuhn-Tucker conditions, Theorem 1, we can now write down the optimality condition.
Corollary 3.
Let $q ∈ ¶ ( b )$. The minimum of f over $¶ ( b )$ is attained in q if, and only if, (a) $q ∗ , y , z = 0$ holds whenever there is an $( x , y , z ) ∈ J ( b )$ with $q x , y , z = 0$; and (b) there exist
• $λ x , y ∈ R$, for each $( x , y )$ with $b x , y y > 0$;
• $μ x , z ∈ R$, for each $( x , z )$ with $b x , z z > 0$;
satisfying the following: For all $y , z$ with $q ∗ , y , z > 0$,
$λ x , y + μ x , z = ln q x , y , z q ∗ , y , z holds for all x ∈ X y z ;$
for all $y , z$ with $q ∗ , y , z = 0$ (but $X y z ≠ ∅$), there is a probability distribution on $X y z$ such that
$λ x , y + μ x , z ≤ ln ( ϱ x y , z ) holds for all x ∈ X y z .$
Example 4 (Part 4/4).
Continuing Example 1, let us now find the optimal solution “by hand”. First of all, note that $q 0 , 0 , 0 = 1 / 4 = q 1 , 1 , 1$ holds for all feasible solutions. By Corollary 2, if q is an optimal solution, since $q 0 , 0 , 0 > 0$, we must have $q x , 0 , 0 > 0$ for $x =$ 2, 3; and similarly, $q x , 1 , 1 > 0$ for x = 2, 3.
To verify whether or not a given q is optimal, we have to find solutions $λ x , y$, $μ x , z$ to the following system of equations and/or inequalities:
The ϱ’s, if needed, have to be found: both $ϱ 01$ and $ϱ 10$ must be probability distributions on {2, 3}.
From the conditions on the zero-nonzero pattern of an optimal solution, we can readily guess an optimal q. Let’s guess wrongly, first, though: $q 2 , y , z = q 3 , y , z = 1 / 16$ for all $y , z$. In this case, all of the constraints in (11) become equations, and the ϱ’s don’t occur. We have a system of equations in the variables $λ · , ·$ and $μ · , ·$. It is easy to check that the system does not have a solution.
Our next guess for an optimal solution q is: $q 2 , 0 , 1 = q 3 , 0 , 1 = 0$; $q 2 , 1 , 0 = q 3 , 1 , 0 = 0$; $q 2 , 0 , 0 = q 3 , 0 , 0 = 1 / 8$; $q 2 , 1 , 1 = q 3 , 1 , 1 = 1 / 8$. In (11), the constraints involving $( y , z ) = ($0, 1$) , ($1, 0) are relaxed to inequalities, with the freedom to pick arbitrary probability distributions $ϱ 01$ and $ϱ 10$ in the RHSs. We choose the following: $λ x , y = μ x , z = 1 2 ln ( 1 / 2 )$ for all $x , y , z$. The equations are clearly satisfied. The inequalities are satisfied if we take $ϱ 0 01 = ϱ 1 01 = 1 / 2$; $ϱ 0 01 = ϱ 1 01 = 1 / 2$.

#### 3.3. Algorithmic Approaches

We now discuss several possibilities of solving the convex program (CP): Gradient descent and interior point methods; geometric programming; cone programming over the so-called exponential cone. We’ll explain the terms in the subsections.

Proposition 2 and Corollary 2 together with the running example make clear that the boundary is “problematic”. On the one hand, the optimal point can sometimes be on the boundary. (In fact, this is already the case for the AND-gate, as computed in Appendix A.2 of [1].) On the other hand, by Corollary 2, optimal boundary points lie in a lower dimensional subset inside the boundary (codimension $≈ X$), and the optimal points on the boundary are “squeezed in” between boundary regions which are “infinitely strongly repellent” (which means to express that the feasible descent direction has directional derivative $− ∞$).
From the perspective of choosing an algorithm, it is pertinent that subgradients do not exist everywhere on the boundary. This rules out the use of algorithms which rely on evaluating (sub-)gradients on the boundary, such as projected (sub-)gradient descent. (And also generic active set and sequential quadratic programming methods (We refer the interested reader to [13] for background on these optimization methods); the computational result in Section 4 illustrate that.
Due to the huge popularity of gradient descent, the authors felt under social pressure to present at least one version of it. Thus, we designed an ad-hoc quick-and-dirty gradient descent algorithm which does its best to avoid the pitfalls of the feasible region: it’s boundary. We now describe this algorithm.
Denote by A the matrix representing the LHS of the equations in (CP); also reduce A by removing rows which are linear combinations of other rows. Now, multiplication by the matrix $P : = A ⏉ ( A ⏉ A ) − 1 A$ amounts to projection onto the tangent space ${ d ∣ A d = 0 }$ of the feasible region, and $P ∇ f ( q )$ is the gradient of f in the tanget space, taken at the point q.
The strategy by which we try to avoid the dangers of approaching the boundary of the feasible region in the “wrong” way is by never reducing the smallest entry of the current iterate q by more then $10 %$. Here is the algorithm.
There are lots of challenges with this approach, not the least of which is deciding on the step size η. Generally, a good step size for gradient descent is 1 over the largest eigenvalue of the Hessian—but the eigenvalues of the Hessian tend to infinity.
The stopping criterion is also not obvious: we use a combination of the norm of the projected gradient, the distance to the boundary, and a maximum of 1000 iterations.
None of these decisions are motivated by careful thought.

#### 3.3.2. Interior Point Methods

Using Interior Point Methods (IPMs) appears to be the natural approach: While the iterates can converge to a point on the boundary, none of the iterates actually lie on the boundary, and that is an inherent property of the method (not a condition which you try to enforce artificially as in the gradient descent approach of the previous section). Consequently, problems with gradients, or even non-existing subgradients, never occur.
Even here, however, there are caveats involving the boundary. Let us consider, as an example, the simple barrier approach sketched in Section 2.1 (page 4). The analysis of this method (see [11]) requires that the function $F : q ↦ t f ( q ) − ∑ x y z ln ( q x y z )$ be self-concordant, which means that, for some constant C, for all $q , h$
where $D 3$ denotes the tensor of third derivatives. The following is proven in [12]:
Proposition 3.
Let $n ≥ 2$, and consider the function
$F : q ↦ t ∑ x = 1 n q x ln ( q x / q ∗ ) − ∑ x ln ( q x )$
There is no C and no t such that (12) holds for all and all h.
The proposition explains why, even for some IPMs, approaching the boundary can be problematic. We refer to [12] for more discussions about self-concordancy issues of the (CP).
Corollary 4.
The complexity of the interior point method via self-concordant barrier for solving (CP) is $O ( M · log 1 ε )$, where M is the complexity of computing the Newton step (which can be done by computing and inverting the Hessian of the barrier)
Proof.
This follows immediately from the fact that a 3-self concordant barrier exists for (CP) (see [12]) and the complexity analysis of barrier method in [11], Chapter 11. ☐
Still, we find the IPM approach (with the commercial Interior Point software “Mosek”) to be the most usable of all the approaches which we have tried.

#### 3.3.3. Geometric Programming

Geometric Programs form a sub-class of Convex Programs; they are considered to be more easily solvable than general Convex Programs: Specialized algorithms for solving Geometric Programs have been around for a half-century (or more). We refer to [14] for the definition and background on Geometric Programming. The Langrange dual of (CP) can be written as the following Geometric Program in the variables $λ ∈ R X × Y$, $μ ∈ R X × Y$ (for simplicity, assume that all the right-hand-sides b$⋅$ are strictly positive):

#### 3.3.4. Exponential Cone Programming

Cone Programming is a far-reaching generalization of Linear Programming, which may contain so-called generalized inequalities: For a fixed closed convex cone $K$, the generalized inequality “$a ≤ K b$” simply translates into $b − a ∈ K$. There is a duality theory similar to that for Linear Programming.
Efficient algorithms for Cone Programming exist for some closed convex cones; for example for the exponential cone [15], $K exp$, which is the closure of all triples $( r , p , q ) ∈ R 3$ satisfying
$q > 0 and q e r / q ≤ p .$
For $q > 0$, the condition on the right-hand side is equivalent to $r ≤ q ln ( p / q )$, from which it can be easily verified that the following “Exponential Cone Program” computes (CP). The variables are $r , q , p ∈ R X × Y × Z$.
$maximize ∑ x , y , z r x , y , z subject to q x , y , ∗ = b x , y y for all ( x , y ) ∈ X × Y q x , ∗ , z = b x , z z for all ( x , z ) ∈ X × Z q ∗ , y , z − p x , y , z = 0 for all ( x , y , z ) ∈ X × Y × Z ( r x , y , z , q x , y , z , p x , y , z ) ∈ K exp for all ( x , y , z ) ∈ X × Y × Z .$
The first two constraints are just the marginal equations; the third type of equations connects the p-variables with the q-variables, and the generalized inequality connects these to the variables forming the objective function.
There are in fact several ways of modeling (CP) as an Exponential Cone Program; here we present the one which is most pleasant both theoretically (the duality theory is applicable) and in practice (it produces the best computational results). For the details, as well as for another model, we refer to [12].

## 4. Computational Results

In this section, we present the computational results obtained by solving Bertschinger et al.’s Convex Program (CP) on a large number of instances, using readily available software out-of-the-box. First we discuss the problem instances, then the convex optimization solvers, then discuss the results. Detailed tables are in the Appendix A.

#### 4.1. Data

In all our instances, the vectors $b y$ and $b z$ are marginals computed form an input probability distribution p on $X × Y × Z$. Occasionally, “noise” (explained below) in the creation p leads to the phenomenon that the sum of all probabilities is not 1. This was not corrected before computing $b y$ and $b z$, as it is irrelevant for the Convex Program (CP).
We have three types of instances, based on (1) “gates”—the “paradigmatic examples” listed in Table 1 of [1]; (2) Example 31 and Figure A.1 of [1]; (3) discretized 3-dimensional Gaussians.
(1)
Gates. The instances of the type (1) are based on the “gates” (Rdn, Unq, Xor, And, RdnXor, RdnUnqXor, XorAnd) described Table 1 of [1]:
• Rdn$X = Y = Z$ uniformly distributed.
• Unq$X = ( Y , Z )$, $Y , Z$ independent, uniformly distributed in ${ 0 , 1 }$.
• Xor$X = Y XOR Z$, $Y , Z$ independent, uniformly distributed in ${ 0 , 1 }$.
• And$X = Y AND Z$, $Y , Z$ independent, uniformly distributed in ${ 0 , 1 }$.
• RdnXor$X = ( Y 1 XOR Z 1 , W )$, $Y = ( Y 1 , W )$, $Z = ( Z 1 , W )$, $Y 1 , Z 1 , W$ independent, uniformly distributed in ${ 0 , 1 }$.
• RdnUnqXor$X = ( Y 1 XOR Z 1 , ( Y 2 , Z 2 ) , W )$, $Y = ( Y 1 , Y 2 , W )$; $Z = ( Z 1 , Z 2 , W )$, $Y 1 , Y 2 , Z 1 , Z 2 , W$ independent, uniformly distributed in ${ 0 , 1 }$.
• XorAnd$X = ( Y XOR Z , Y AND Z )$, $Y , Z$ independent, uniformly distributed in ${ 0 , 1 }$.
Each gate gives rise to two sets of instances: (1a) the “unadulterated” probability distribution computed from the definition of the gate (see Table A1); (1b) empirical distributions generated by randomly sampling from $W$ and computing $( x , y , z )$. In creating the latter instances, we incorporate “noise” by perturbing the output randomly with a certain probability. We used 5 levels of noise, corresponding to increased probabilities of perturbing the output. For example, And 0 refers to the empirical distribution of the And gate without perturbation (probability $= 0$), And 4 refers to the empirical distribution of the And gate with output perturbed with probability $0.1$. The perturbation probabilities are: 0, $0.001$, $0.01$, $0.05$, $0.1$. (See Table A2 and Table A3).
(2)
Example-31 instances. In Appendix A.2 of their paper, Bertschinger et al. discuss the following input probability distribution: $X , Y , Z$ are independent uniformly random in ${ 0 , 1 }$. They present this as an example that the optimization problem can be “ill-conditioned”. We have derived a large number of instances based on that idea, by taking $( X , Y , Z )$ uniformly distributed on $X × Y × Z$, with $X$ ranging in ${ 2 , ⋯ , 5 }$ and $Y = Z$ in ${ X , ⋯ , 5 · X }$. These instances are referred to as “independent” in Table A4. We also created “noisy” versions of the probability distributions by perturbing the probabilities. In the results, we split the instances into two groups according to whether the fraction $Y / X$ is at most 2 or greater than 2. (The rationale behind the choice of the sizes of $X$, $Y$, $Z$ is the fact mentioned in Section 3.2, that the Hessian has a kernel in the tangent space if the ratio is high.)
(3)
Discretized gaussians.We wanted to have a type of instances which was radically different from the somewhat “combinatorial” instances (1) and (2). We generated (randomly) twenty $3 × 3$ covariance matrices between standard Gaussian random variables $X$, $Y$, and $Z$. We then discretized the resulting continuous 3-dimensional Gaussian probability distribution by integrating numerically (We used the software Cuba [16] in version 4.2 for that) over boxes $[ 0 , 1 ] 3$, $[ 0 , 0.75 ] 3$, $[ 0 , 0.5 ] 3$, $[ 0 , 0.25 ] 3$; and all of their translates which held probability mass at least $10 − 20$.
We grouped these instances according to the number of variables, see Table A5: The instances in the “Gauss-1” group are the ones with at most 1000 variables; “Gauss-2” have between 1000 and 5000; “Gauss-3” have between 5000 and 250,000, “Gauss-4” between 25,000 and 75,000.

#### 4.2. Convex Optimization Software We Used

For our computational results, we made an effort to use as many software toolboxes as we could get our hands on. (The differences in the results are indeed striking.) Most of our software was coded in the Julia programming language (version 0.6.0) using the “MathProgBase” package see [17] which is part of JuliaOpt. (The only the CVXOPT interior point algorithm and for the Geometric Program did we use Python.) On top of that, the following software was used.
• CVXOPT [18] is written and maintained by Andersen, Dahl, and Vandenberghe. It transforms the general Convex Problems with nonlinear objective function into an epigraph form (see Section 2.1), before it deploys an Interior Point method. We used version 1.1.9.
• Artelys Knitro [19] is an optimization suite which offers four algorithms for general Convex Programming, and we tested all of them on (CP). The software offers several different convex programming solvers: We refer by “Knitro_Ip” to their standard Interior Point Method [20]; by “Knitro_IpCG” to their IPM with conjugate gradient (uses projected cg to solve the linear system [21]); “Knitro_As” is their Active Set Method [22]; “Knitro_SQP” designates their Sequential Quadratic Programming Method [19]. We used version 10.2.
• Mosek [23] is an optimization suite which offers algorithms for a vast range of convex optimization problems. Their Interior Point Method is described in [24,25]. We used version 8.0.
• Ipopt [20] is a software for nonlinear optimization which can also deal with non-convex objectives. At its heart is an Interior Point Method (as the name indicates), which is enhanced by approaches to ensure convergence even in the non-convex case. We used version 3.0.
• ECOS. We are aware of only two Conic Optimization software toolboxes which allow to solve Exponential Cone Programs: ECOS and SCS.
ECOS is a lightweigt numerical software for solving convex cone programs [26], using an Interior Point approach. We used the version from Nov 8, 2016.
• SCS [27,28] stands for Splitting Conic Solver. It is a numerical optimization package for solving large-scale convex cone problems. It is a first-order method, and generally very fast. We used version 1.2.7.

#### 4.3. Results

We now discuss the computational results of every solver. The tables in Appendix A give three data points for each pair of instance (group) and solver: Whether the instance was solved, or, respectively, which fraction of the instances in the group were solved (“solved”); the time (“time”) or, respectively, average time “avg. tm” needed (All solvers had a time limit 10,000.0 s. We used a computer server with Intel(R) Core(TM) i7-4790K CPU (4 cores) and 16GB of RAM) to solve it/them. The third we call the “optimality measure”: When a solver reports having found (optimal) primal and dual solutions, we computed:
• the maximum amount by which any of the marginal equations is violated;
• the maximum amount by which any of the nonnegativity inequaities is violated;
• the maximum amount by which the inequality “$A ⏉ λ ≤ g$” in Theorem 1 is violated;
• the maximum amount by which the complementarity is violated.
Of all these maxima we took that maximum to have an indication of how reliable the returned solution is, with respect to feasibility and optimality. In the tables, the maximum is then taken over all the instances in the group represented in each row of the table. We chose this “worst-case” approach to emphasize the need for a reliable software.

The Gradient Descent algorithm sketched in the previous section finds its limits when the matrix P becomes too large to fit into memory, which happens for the larger ones of the Gaussian instances.
Even before that, the weakness of the algorithm are clear in the fact that the best solution it finds is often not optimal (the other methods produce better solutions). The running times are mid-field compared to the other methods.
It is probable that a more carefully crafted gradient descent would preform better, but the conceptual weaknesses of the approach remain.

#### 4.3.2. Interior Point algorithms

Cvxopt It terminated with optimal solution on all the noiseless discrete gates except “RDN”see Table A1, Table A2 and Table A3. But whenever the noise was added it started failing except on “XOR”. So as the number of samples increased even when decreasing the noise it still failed. The optimality measure was always bad except on some of the noiseless gates. Mainly the dual solution was weakly feasible on the solved ones. It bailed out 56% due to KKT system problems and the rest was iteration and time limit.
For the uniformly binary gates again Cvxopt failed on the noisy distributions with 34% success on Noisy 1 and 0% on Noisy 2. It solved all the independent correctly with acceptable optimality measure.
For the Gaussian, it totally failed with 40% computational problems in solving the KKT system and 38% time limit and the rest were iteration limit. The solver was slowest compared to others and required high memory to store the model.
Knitro_Ip It terminated with optimal solution on all the discrete gates with good optimality measure see Table A1, Table A2, Table A3 and Table A4. But for the Gaussian, it failed most of the time. It solved only $5 %$ of all Gaussian instances where $75 %$ of the solved have less than 1000 variable Table A5. $25 %$ of the unsolved instances couldn’t get a dual feasible solution. The rest stopped due to iteration or time limit but still they had either infeasible dual or the KKT conditions were violated.
Kintro_Ip was a little bit slower than Mosek on most of the discrete gates. On the Gaussian, it very slow. For most of the instances, Mosek had its optimality measure 1000 times better than Knitro_Ip. Even though it worked well for the discrete gates we can’t rely on it for this problem since it is not stable with the variables number.
Knitro IpCG. It terminated with optimal solution on all the discrete gates except “XOR” and “RDNXOR” see Table A2 and Table A3. For those two particular gates, it reached the iteration limit. For the Gaussian, it did a little better than Knitro_Ip since the projected conjugate gradient helps with large scale problems. It was able to solve only 6.2% none of which had less than 1000 variables and with 50% having more than 8000 variables see Table A5. $25 %$ of the unsolved instances couldn’t get a dual feasible solution. The rest stopped due to iteration or time limit with similar optimality measure violation as Knitro_Ip.
Knitro_IpCG was considerably slower than Mosek and Knitro_Ip. And it had worse optimality measure than Mosek on most of its solved instances. Since Knitro_IpCG couldn’t solve a big set of instances this solver is unreliable for this problem.
Mosek terminated with reporting an optimal solution almost on all instances. For the gates, with sufficient optimality measure see Table A1, Table A2, Table A3 and Table A4. For the Gaussian gates, it terminated optimally for 70% of the instances see Table A5. 25% bailed out mainly since the gap between the primal and dual is negligible with the dual solution and primal are weakly feasible. Most of the latter instances had more than 10,000 variables. For the last 5% Mosek reached its iteration limit.
For all discrete gates Mosek terminated within milliseconds except “RDNUNQXOR” it terminated in at most 1 s. For the Gaussians only those with more than 25,000 variables took up to 5 s to be solved, the rest was done in less than a second.
Ipopt failed on the unadulterated “AND”, “UNQ”, and “XORAND” distributions since the optimal solution lays on the boundary and Ipopt hits the boundary quit often. Also, it couldn’t solve any of the noiseless with sampling “AND” and “UNQ” gates, but managed to solve 30% of those of “XORAND”. When the noise was applied, it worked 90% with some noisy “RDNXOR” gates, but bailed out on “RDNUNQXOR” gate with $10 % , 5 % , and 1 %$. Note that the optimality measure was highly violated even when Ipopt terminates optimally see Table A1, Table A2, Table A3 and Table A4 The violation mainly was in the dual feasibility and for the instances which had the solution on the boundary.
With the Gaussians, Ipopt couldn’t solve any instance see Table A5. In 55% of the instances it terminated with computational errors or search direction being to small. 25% terminated with iteration limit and the rest with time limit. Same as for discrete case the optimality measure was not good. For discrete gates Ipopt was as fast a Mosek. Overall Ipopt is an unreliable solver for this problem.

#### 4.3.3. Geometric Programming: CVXOPT

CVXOPT has a dedicated Geometric Programming interface, which we used to solve the Geometric Program (GP). The approach cannot solve a single instance to optimality: CVXOPT always exceeds the limit on the number of iterations. Hence, we have not listed it in the tables.

#### 4.3.4. Exponential Cone Programming: ECOS & SCS

SCS failed on each and every instance. For ECOS, however, on average the running times were fast, but the results were mildly good. For types 1 and 2 instances, ECOS was most of the time the fastest solver. It terminated optimally for all the instances of types 1 and 2.
ECOS terminated suboptimal on 90% of the Gaussian instances. On the rest of the Gaussian gates it ran into numerical problems, step size became negligible, and terminated with unfeasible solutions.
ECOS handled types 1 and 2 instances pretty well. It was mostly slow on the Gaussian instances. For example, some Gaussian instances took ECOS more than seven hours to terminate sub-optimally. Modeling the problem as an exponential cone programming seems to be a good choice. ECOS vigorous performance makes it a reliable candidate for this optimization problem.

#### 4.3.5. Miscellaneous Approaches

We include the following in the computational results for completeness. From what we discussed in the previous section, it is no surprise at all that these methods perform badly.
Knitro_SQP failed on the unadulterated “AND” and “RDNUNQAND” distributions see Table A1. When the noise was added to the discrete gates it couldn’t solve more than 35% per gate see Table A2 and Table A3. It didn’t have better optimality measure than Mosek except on “RDN 1” which is can’t be built on since Knitro_SQP solved only 8% of those instances. Note that on the unsolved instances, Knitro_SQP was the only solver which gave wrong answers (30% of the unsolved) i.e., claimed that the problem is infeasible.
Similarly, on Gaussian it failed in 78% of the instances due to computational problems and the rest were iteration or time limit see Table A5. This confirms that such type of algorithms is unsuitable for our problem; see Section 3.
Knitro_As. It failed on all the instances, trying to evaluate the Hessian at boundary points.

## 5. Conclusions

A closer look at the Convex Program proposed by Bertschinger et al. [1] reveals some subtleties which makes the computation of the information decomposition challenging. We have tried to solve the convex program using a number of software packages out-of-the-box (including, BTW, [29]).
Two of the solvers, namely, Mosek and ECOS work very satisfactorily. Even though, ECOS sometimes was 1000 times faster than Mosek, on some of types 1 and 2 instances, the time difference was no more than 5 s making this advantage rather dispensable. Nevertheless, on these instances, Mosek had better optimality measures. Note that on the hard problems of type 1, namely, “RDNUNQXOR”, Mosek was faster than ECOS see Figure 1.
On the other hand, ECOS was slower than Mosek on the Gaussian instances especially when the (CP) had a huge number of variables. For these instances, the time difference is significant (hours) see Figure 2, which is rather problematic for ECOS.
Hence each of the two solvers has its pros and cones. This means that using both solvers is an optimal strategy when approaching the problem. One suggestion would be to use ECOS as a surrogate when Mosek fails to give a solution. Earlier ad-hoc approaches, which were based on CVXOPT and on attempts to “repair” a situation when the solver failed to converge, appear to be redundant.

## Supplementary Materials

The source code in Julia used for the computations can be found here: github.com/dot-at/BROJA-Bivariate-Partial_Information_Decomposition.

## Acknowledgments

This research was supported by the Estonian Research Council, ETAG (Eesti Teadusagentuur), through PUT Exploratory Grant #620. We also gratefully acknowledge funding by the European Regional Development Fund through the Estonian Center of Excellence in Computer Science, EXCS.

## Author Contributions

RV+DOT developed the first algorithms and wrote the code. DOT+AM contributed the propositions and proofs, AM performed the computations and created the tables.

## Conflicts of Interest

The authors declare no conflict of interest.

## Appendix A. Tables of Instances and Computational Results

Table A1. Results for type (1a) instances: Gates—unadulterated probability distributions.
Table A1. Results for type (1a) instances: Gates—unadulterated probability distributions.
% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)time ($10 − 2$)
XORy10.06y0.70.31y980.8n y20.63y0.180.28y0.151.5y0.030.00125
ANDy0.52$1 e 7$y300.5y5213n n y0.060.15y0.0910y0.030.0125
UNQy0.50.02y0.18$1 e − 4$y0.170.04n y0.82$1 e − 12$n0.07$1 e − 4$n y0.020.0114
RDNy0.40.007y0.07$1 e − 7$y0.090.38n y0.14$1 e − 12$y0.030.025n y0.020.00215
XORANDy0.62$1 e 6$y0.20.53y0.90.14n y0.6$1 e − 7$y0.090.2n y0.030.00514
RDNXORy30.12y0.20.06y2710.3n y y0.22$e − 4$y0.22$e 4$y0.050.00224
RDNUNQXORy140.4y1.30.001y5370.09n n y0.016$e − 7$y0.5150y0.160.00877
Table A2. Results for type (1b) instances: Gates (XOR, AND, UNQ, RDN) with noise.
Table A2. Results for type (1b) instances: Gates (XOR, AND, UNQ, RDN) with noise.
% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$avg. tm $10 − 2$
XOR 01000.013$e 6$1000.90.461001501.060 3040.8710040.0041000.12$e 6$1000.040.0128
XOR 11002.23$e 6$10010.98892542.020 1740.671001.90.00661000.23$e 6$1000.030.0135
XOR 210023$e 6$10010.926325910 2690.91100120.00531000.32$e 6$1000.040.0136
XOR 39924$e 6$10010.967036310 2770.9910070.00561000.22$e 6$1000.030.0136
XOR 410014$e 6$1001.31984010 20130.9810080.00521000.1$1 e 6$1000.030.0132
AND 01003000.6100400.71007622.50 0 1000.070.410 1000.030.0127
AND 16522$e 7$10012.9100173.20 0 1000.9241000.32$e 6$1000.040.0142
AND 26731$e 6$10021.71004.940 0 1001.20.111000.3$1 e 7$1000.040.0140
AND 35132$e 7$1001.91.91002180 0 1001.40.0410042$e 6$1000.040.0139
AND 45422$e 7$10015.4100290 0 1000.5411000.54$e 4$1000.040.0140
UNQ 01000.42$e 4$1000.4$3 e − 3$1000.50.090 1001.4$6 e − 12$1000.04$1 e − 4$0 1000.020.00716
UNQ 110044$e 6$100161001.990 33651000.60.0361000.73$e 5$1000.070.0142
UNQ 210055$e 6$10020.0410020.840 26130.461000.690.121000.74$e 5$1000.070.00844
UNQ 310057$e 7$100261001400 389101000.70.0221000.64$e 5$1000.070.00444
UNQ 42040.091001.50.041001210 1882.61000.60.00191000.73$e 5$1000.080.00542
RDN 022$e 6$1000.4$6 e − 4$1000.40.510 1000.6$1 e − 12$10010.0029300.5$1 e 5$1000.020.0116
RDN 10 10012.81001.22.80 82$2 e − 12$1000.30.0051000.32$e 7$1000.040.0141
RDN 20 1001.96.71001.96.20 144.5$2 e − 10$1001.10.011000.52$e 6$1000.040.0140
RDN 30 1001.74.11001.89.80 3542.71000.42$e − 4$1000.64$e 6$1000.050.0141
RDN 41122$e 6$1000.90.071000.810 33$2 e − 6$1000.2$1 e − 5$1000.44$e 6$1000.040.00841
Table A3. Results for type (1b) instances: Gates (XORAND, RDNXOR, RDNUNQXOR) with noise.
Table A3. Results for type (1b) instances: Gates (XORAND, RDNXOR, RDNUNQXOR) with noise.
% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedavg. tm $10 − 2$Opt meas. $10 − 6$% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)time ($10 − 2$)
XORAND 01000.6201000.60.741000.90.20 1002$2 e − 6$1000.10.40 1000.030.0127
XORAND 1946$1 e 7$1001.1510056.20 3594.910010.51000.68$e 5$1000.081843
XORAND 267812.610026.2100108.20 15225.41001.30.41000.8$1 e 3$1000.090.0142
XORAND 312 10025.710028.70 15166.91001.32.210016$e 4$1000.10.0342
XORAND 40 10027.2100380 932 1000.7401000.84$e 4$1000.10.0443
RDNXOR 010033.1110010.467040710 3060.2310012$e − 9$1000.22$e 4$1000.060.0127
RDNXOR 10 1003.518.21002300 0 10025512610.0161000.30.0140
RDNXOR 2119017010096010032500 0 1002505090101.91000.40.0141
RDNXOR 30 1008931003300 1 100150.01994.70.021000.30.0140
RDNXOR 411452$e 7$1006901005310 0 100150.0110050.021000.40.0143
RDNUNQXOR 090253$e 6$10020.3410063610 0 10060.21000.72341000.30.00875
RDNUNQXOR 10 10057297100341070 0 1002690.40 1002000.011156
RDNUNQXOR 20 100110364100601280 0 10048340 1001950.011170
RDNUNQXOR 30 100130561100915970 0 100307200 1002240.011140
RDNUNQXOR 40 100129422100132$1 e 4$0 0 1003880.003924372071005550.0241160
Table A4. Results for type (2) instances: Example-31.
Table A4. Results for type (2) instances: Example-31.
InstanceCVXOPTMosekIpoptECOSGD
% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)time ($10 − 2$)
Independent 1100281.1510049$1 e − 4$100183$1 e − 3$1000.50.01100
Independent 21002550.2100220.0039243720710020.02802
Noisy 1341132$1 e 7$100870.008386.22$1 e 3$1002.80.014100
Noisy 20 1006480.01176022$1 e − 3$10080.016813
Table A5. Results for type (3) instances: Discretized 3-dimensional Gaussians.
Table A5. Results for type (3) instances: Discretized 3-dimensional Gaussians.
% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)% solvedtime ($10 − 2$)opt. meas. ($10 − 6$)time ($10 − 2$)
Gauss-10 2182040 0 0 710.270.40 1004.20.191940
Gauss-20 2.795010510203000 0 757.640 96521.061890
Gauss-30 0 1121709000 0 6555120 966326
Gauss-40 0 7256020000 0 574$e 4$600 753$e 5$400

## References

1. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
2. Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv, 2010; arXiv:1004.2515. [Google Scholar]
3. Griffith, V.; Koch, C. Quantifying synergistic mutual information. In Guided Self-Organization: Inception; Springer: New York, NY, USA, 2014; pp. 159–190. [Google Scholar]
4. Harder, M.; Salge, C.; Polani, D. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar] [CrossRef] [PubMed]
5. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J. Shared information—New insights and problems in decomposing information in complex systems. In Proceedings of the European Conference on Complex Systems, Brussels, Belgium, 2–7 September 2012; Springer: Cham, Switzerland, 2013; pp. 251–269. [Google Scholar]
6. Nemirovski, A. Efficient Methods in Convex Programming. Available online: http://www2.isye.gatech.edu/~nemirovs/Lect_EMCO.pdf (accessed on 6 October 2017).
7. Wibral, M.; Priesemann, V.; Kay, J.W.; Lizier, J.T.; Phillips, W.A. Partial information decomposition as a unified approach to the specification of neural goal functions. Brain Cognit. 2017, 112, 25–38. [Google Scholar] [CrossRef] [PubMed]
8. Ruszczyński, A.P. Nonlinear Optimization; Princeton University Press: Princeton, NJ, USA, 2006; Volume 13. [Google Scholar]
9. Bubeck, S.; Eldan, R. The entropic barrier: A simple and optimal universal self-concordant barrier. arXiv, 2014; arXiv:1412.1587. [Google Scholar]
10. Bubeck, S.; Eldan, R. The entropic barrier: A simple and optimal universal self-concordant barrier. In Proceedings of the 28th Conference on Learning Theory, Paris, France, 3–6 July 2015. [Google Scholar]
11. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
12. Makkeh, A. Applications of Optimization in Some Complex Systems. Ph.D. Thesis, University of Tartu, Tartu, Estonia, 2017. [Google Scholar]
13. Nocedal, J.; Wright, S. Numerical Optimization; Springer: New York, NY, USA, 2006. [Google Scholar]
14. Boyd, S.; Kim, S.J.; Vandenberghe, L.; Hassibi, A. A tutorial on geometric programming. Optim. Eng. 2007, 8, 67–127. [Google Scholar] [CrossRef]
15. Chares, R. Cones and interior-point algorithms for structured convex optimization involving powers and exponentials. Doctoral dissertation, UCL-Université Catholique de Louvain, Louvain-la-Neuve, Belgium, 2009. [Google Scholar]
16. Hahn, T. CUBA—A library for multidimensional numerical integration. Comput. Phys. Commun. 2005, 168, 78–95, [hep-ph/0404043]. [Google Scholar] [CrossRef]
17. Lubin, M.; Dunning, I. Computing in operations research using Julia. INFORMS J. Comput. 2015, 27, 238–248. [Google Scholar] [CrossRef]
18. Sra, S.; Nowozin, S.; Wright, S.J. Optimization for Machine Learning; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
19. Byrd, R.; Nocedal, J.; Waltz, R. KNITRO: An Integrated Package for Nonlinear Optimization. In Large-Scale Nonlinear Optimization; Springer: New York, NY, USA, 2006; pp. 35–59. [Google Scholar]
20. Waltz, R.A.; Morales, J.L.; Nocedal, J.; Orban, D. An interior algorithm for nonlinear optimization that combines line search and trust region steps. Math. Program. 2006, 107, 391–408. [Google Scholar] [CrossRef]
21. Byrd, R.H.; Hribar, M.E.; Nocedal, J. An interior point algorithm for large-scale nonlinear programming. SIAM J. Optim. 1999, 9, 877–900. [Google Scholar] [CrossRef]
22. Byrd, R.H.; Gould, N.I.; Nocedal, J.; Waltz, R.A. An algorithm for nonlinear optimization using linear programming and equality constrained subproblems. Math. Program. 2004, 100, 27–48. [Google Scholar] [CrossRef]
23. ApS, M. Introducing the MOSEK Optimization Suite 8.0.0.94. Available online: http://docs.mosek.com/8.0/pythonfusion/intro_info.html (accessed on 6 October 2017).
24. Andersen, E.D.; Ye, Y. A computational study of the homogeneous algorithm for large-scale convex optimization. Comput. Optim. Appl. 1998, 10, 243–269. [Google Scholar] [CrossRef]
25. Andersen, E.D.; Ye, Y. On a homogeneous algorithm for the monotone complementarity problem. Math. Program. 1999, 84, 375–399. [Google Scholar] [CrossRef]
26. Domahidi, A.; Chu, E.; Boyd, S. ECOS: An SOCP solver for embedded systems. In Proceedings of the European Control Conference (ECC), Zurich, Switzerland, 17–19 July 2013; pp. 3071–3076. [Google Scholar]
27. O’Donoghue, B.; Chu, E.; Parikh, N.; Boyd, S. Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding. J. Optim. Theory Appl. 2016, 169, 1042–1068. [Google Scholar]
28. O’Donoghue, B.; Chu, E.; Parikh, N.; Boyd, S. SCS: Splitting Conic Solver, Version 1.2.7. Available online: https://github.com/cvxgrp/scs (accessed on 6 October 2017).
29. Grant, M.; Boyd, S. CVX: Matlab Software for Disciplined Convex Programming, Version 2.1. Available online: http://cvxr.com/cvx (accessed on 6 October 2017).
Figure 1. Comparison Between the running times of Mosek and ECOS for problems whose number of variables is at below 10,500.
Figure 1. Comparison Between the running times of Mosek and ECOS for problems whose number of variables is at below 10,500.
Figure 2. Comparison Between the running times of Mosek and ECOS for problems whose number of variables is at least 16,000.
Figure 2. Comparison Between the running times of Mosek and ECOS for problems whose number of variables is at least 16,000.

## Share and Cite

MDPI and ACS Style

Makkeh, A.; Theis, D.O.; Vicente, R. Bivariate Partial Information Decomposition: The Optimization Perspective. Entropy 2017, 19, 530. https://doi.org/10.3390/e19100530

AMA Style

Makkeh A, Theis DO, Vicente R. Bivariate Partial Information Decomposition: The Optimization Perspective. Entropy. 2017; 19(10):530. https://doi.org/10.3390/e19100530

Chicago/Turabian Style

Makkeh, Abdullah, Dirk Oliver Theis, and Raul Vicente. 2017. "Bivariate Partial Information Decomposition: The Optimization Perspective" Entropy 19, no. 10: 530. https://doi.org/10.3390/e19100530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.