Computational Techniques for Investigating Information Theoretic Limits of Information Systems

: Computer-aided methods, based on the entropic linear program framework, have been shown to be effective in assisting the study of information theoretic fundamental limits of information systems. One key element that signiﬁcantly impacts their computation efﬁciency and applicability is the reduction of variables, based on problem-speciﬁc symmetry and dependence relations. In this work, we propose using the disjoint-set data structure to algorithmically identify the reduction mapping, instead of relying on exhaustive enumeration in the equivalence classiﬁcation. Based on this reduced linear program, we consider four techniques to investigate the fundamental limits of information systems: (1) computing an outer bound for a given linear combination of information measures and providing the values of information measures at the optimal solution; (2) efﬁciently computing a polytope tradeoff outer bound between two information quantities; (3) producing a proof (as a weighted sum of known information inequalities) for a computed outer bound; and (4) providing the range for information quantities between which the optimal value does not change, i.e., sensitivity analysis. A toolbox, with an efﬁcient JSON format input frontend, and either Gurobi or Cplex as the linear program solving engine, is implemented and open-sourced.


Introduction
One of the most distinguishing features of information theory is its ability to provide fundamental limits to various communication and computation systems, which may be extremely difficult, if not impossible, to establish otherwise. There are a set of well-known information inequalities, such as the non-negativity of mutual information and conditional mutual information, which are guaranteed to hold simply due to the basic mathematical properties of the information measures, such as entropy and conditional mutual information. Fundamental limits of various information systems can be obtained by combining these inequalities strategically. The universality of the information measures implies that fundamental limits of diverse information systems can be derived in a general manner.
Conventionally, the proofs for such fundamental limits are hand-crafted and written as a chain of inequalities, where each individual step is one of the aforementioned known information inequalities, or certain equality and inequalities implied by the specific problem settings. As information systems become more and more complex, such manual efforts have become increasingly unwieldy, and computer-aided approaches naturally emerge as possible alternatives. A computer-aided approach can be particularly attractive and productive during the stage of initial problem exploration and when the complexity of the system prevents an effective bound to be constructed manually.
The entropic linear programming (LP) framework [1] was the first major step toward this direction; however, since the resultant LPs are usually very large, a direct adoption limits its applicability to simple problem settings, typically with no greater than ten random variables. In several recent works [2][3][4][5][6][7] which were led by the first author of the current work, it was shown that reductions based on problem-specific symmetry and dependence relations can be used to make the problems more manageable. In this work, we further develop this research direction. First, we adopt an efficient data structure, namely disjointset [8], to improve the efficiency of the aforementioned reduction. Then, we consider and develop four techniques to investigate the fundamental limits of information systems: 1) computing a bound for a given linear combination of information measures and providing the value of information measures at the optimal solution; 2) efficiently computing a polytope tradeoff outer bound between two information quantities; 3) producing a proof (as a weighted sum of known information inequalities; and 4) providing the range for information quantities between which the optimal value does not change (sensitivity analysis). To improve the utility of the approach, an efficient JSON input format is provided, and a toolbox using either Cplex [9] or Gurobi [10] as the linear program solving engine, is implemented and open-sourced [11].

Literature Review
In a pioneer work, Yeung [1] pointed out and demonstrated how a linear programming (LP) framework can be used to computationally verify whether an information inequality involving Shannon's information measures is true or not, or more precisely, whether it can be proved using a general set of known information inequalities, which has since been known as Shannon-type inequalities. A MATLAB implementation based on this connection, called the information theory inequality prover (ITIP) [12], was made available online at the same time. A subsequent effort by another group (XITIP [13]) replaced the MATLAB LP solver with a more efficient open source LP solver and also introduced a more user-friendly interface. Later on, a new version of ITIP also adopted a more efficient LP solver to improve the computation efficiency. ITIP and XITIP played important roles in the study of non-Shannon-type inequalities and Markov random fields [14][15][16].
Despite its considerable impact, ITIP is a generic inequality prover, and utilizing it on any specific coding problem can be a daunting task. It can also fail to provide meaningful results due to the associated computation cost. Instead of using the LP to verify a hypothesized inequality, a more desirable approach is to use a computational approach on the specific problem of interest to directly find the fundamental limits and, moreover, to utilize the inherent problem structure in reducing the computation burden. This was the approach taken on several problems of recent interest, such as distributed storage, coded caching, and private information retrieval [2][3][4][5][6][7], and it was shown to be rather effective.
One key difference in the aforementioned line of work, compared to several other efforts in the literature, is the following. Since most information theoretic problems of practical relevance or current interests induce a quite large LP instance, considerable effort was given to reducing the number of LP variables and the number of LP constraints algorithmically, before the LP solver is even invoked. Particularly, problem-specific symmetry and dependence have been used explicitly for this purpose, instead of the standard approach of leaving them for the LP solver to eliminate. This approach allows the program to handle larger problems than ITIP can, which has yielded meaningful results on problems of current interest. Moreover, through LP duality, it has been demonstrated in Reference [2] that human-readable proofs can be generated by taking advantage of the dual LP. This approach of generating proofs has been adopted and extended by several other works [17,18].
From more theoretical perspectives, a minimum set of LP constraints under problemspecific dependence was fully characterized in Reference [19], and the problem of counting the number of LP variables and constraints after applying problem specific symmetry relations was considered in Reference [20]. However, these results do not lead to any algorithmic advantage, since the former relies on a set of relationship tests which are algorithmically expensive to complete, and the latter provided a method of counting instead of enumerating these information inequalities.
Li et al. used a similar computational approach to tackle the multilevel diversity coding problem [17] and multi-source network coding problems with simple network topology [21] (also see Reference [22]); however, the main focus was to provide an efficient enumeration and classification of the large number of specific small instances (all instances considered require 7 or fewer random variables), where each instance itself poses little computation issue. Beyond computing outer bounds, the problem of computationally generating inner bounds was also explored [23,24].
Recently, Ho et al. [18] revisited the problem of using the LP framework for verifying the validity of information inequalities and proposed a method to computationally disprove certain information inequalities. Moreover, it was shown that the alternating direction method of multipliers (ADMM) can be used to speed up the LP computation. In a different application of the LP framework [25], Gattegno et al. used it to improve the efficiency of the Fourier-Motzkin elimination procedure often encountered in information theoretic study of multiterminal coding problems. In another generalization of the approach, Gurpinar and Romashchenko used the computational approach in an extended probability space such that information inequalities beyond Shannon-types may become active [26].

Information Inequalities and Entropic LP
In this section, we provide the background and a brief review of the entropic linear program framework. Readers are referred to Reference [27][28][29] for more details.

Information Inequalities
The most well-known information inequalities are based on the non-negativity of the conditional entropy and mutual information, which are where the single random variables X 1 , X 2 , and X 3 can be replaced by sets of random variables. A very large number of inequalities can be written this way, when the problem involves a total of n random variables X 1 , X 2 , . . . , X n . Within the set of all information inequalities in the form shown in (1), many are implied by others. There are also other information inequalities implied by the basic mathematical properties of the information measure but not in these forms or directly implied by them, which are usually referred to as non-Shannon-type inequalities. Non-Shannon-type inequalities are notoriously difficult to enumerate and utilize [30][31][32][33]. In practice, almost all bounds for the fundamental limits of information systems have been derived using only Shannon-type inequalities.

The Entropic LP Formulation
Suppose we express all the relevant quantities in a particular information system (a coding problem) as random variables (X 1 , X 2 , . . . , X n ), e.g., X 1 is an information source, and X 3 is its encoded version at a given point in the system. In this case, the derivation of a fundamental limit in an information system or a communication system may be understood conceptually as the following optimization problem: minimize: a weighted sum of certain joint entropies subject to: (I) generic constraints that any information measures must satisfy (II) problem specific constraints on the information measures, where the variables in this optimization problem are all the information measures on the random variables X 1 , X 2 , . . . , X n that we can write down in this problem. For example, if H(X 2 , X 3 ) is a certain quantity that we wish to minimize (e.g., as the total amount of the compressed information in the system), then the solution of the optimization problem with H(X 2 , X 3 ) being the objective function will provide the fundamental limit of this quantity (e.g., the lowest amount we can compress the information to).
The first observation is that the variables in the optimization problem may be restricted to all possible joint entropies. In other words, there are 2 n − 1 variables of the form of H(X A ), where A ⊆ {1, 2, . . . , n}. We do not need to include conditional entropy, mutual information, nor conditional mutual information because they may be written simply as linear combinations of the joint entropies.
Next, let us focus on the two classes of constraints. To obtain a good (hopefully tight) bound, we wish to include all the Shannon-type inequalities as generic constraints in the first group of constraints. However, enumerating all of them is not the best approach, as we have mentioned earlier that there are redundant inequalities that are implied by others. Yeung identified a minimal set of constraints which are called elemental inequalities [1,28]: Note that both (2) and (3) can be written as linear constraints in terms of joint entropies. It is straightforward to see that there are n + ( n 2 )2 n−2 elemental inequalities. These are the generic constraints that we will use in group (I).
The second group of constraints are the problem specific constraints. These are usually the implication relations required by the system or the specific coding requirements. For example, if X 4 is a coded representation of X 1 and X 2 , then this relation can be represented as which is a linear constraint. This group of constraints may also include independence and conditional independence relations. For example, if X 1 , X 3 , X 7 are three mutually independent sources, then this relation can be represented as which is also a linear constraint.
In the examples in later sections, we will provide these constraints more specifically. The two groups of constraints are both linear in terms of the optimization problem variables, i.e., the 2 n − 1 joint entropies (defined on the n random variables); thus, we have a linear program (LP) at hand.

Symmetry and Dependence Relations
In this section, we discuss two relations that can help reduce the complexity of the entropic LP, without which many information system or coding problems of practical interest appear too complex to be solved in the entropic LP formulation. To be more specific, we first introduce two working examples that will be used throughout this paper to illustrate the main idea.

Two Examples
The two example problems are the regenerating code problem and the coded caching problem: • The (n, k, d) regenerating code problem [34,35] is depicted in Figure 1. It considers the situation that a message is stored in a distributed manner in n nodes, each having capacity α (Figure 1a). Two coding requirements need to be satisfied: 1) the message can be recovered from any k nodes (Figure 1b), and any single node can be repaired by downloading β amount of information from any d of the other nodes ( Figure 1c). The fundamental limit of interest is the optimal tradeoff between the storage cost α and the download cost β. We will use the (n, k, d) = (4, 3, 3) case as our working example.
In this setting, the stored contents are W 1 , W 2 , W 3 , W 4 , and the repair message sent from node i to repair j is denoted as S i,j . In this case, the set of the random variables in the problem are Some readers may notice that we do not include a random variable to represent the original message stored in the system. This is because it can be equivalently viewed as the collection of (W 1 , W 2 , W 3 , W 4 ) and can, thus, be omitted in this formulation. • The (N, K) coded caching problem [36] considers the situation that a server, which holds a total N mutually independent files of unit size each, serves a set of K users, each with a local cache of size M. The users can prefetch some content (Figure 2a), but when they reveal their requests (Figure 2b), the server must calculate and multicast a common message of size R (Figure 2c). The requests are not revealed to the server beforehand, and the prefetching must be designed to handle all cases. The fundamental limit of interest is the optimal tradeoff between the cache capacity M and the transmission size R. In this setting, the messages are denoted as (W 1 , W 2 , . . . , W N ), the prefetched contents as (Z 1 , Z 2 , . . . , Z K ), and the transmission when the users requests (d 1 , d 2 , . . . , d K ) is written as X d 1 ,d 2 ,...,d K . We will use the case (N, K) = (2, 3) as our second running example in the sequel, and, in this case, the random variables in the problem are (a) Encoding (b) Recovery (c) Repair

The Dependency Relation
The dependency (or implication) relation, e.g., the one given in (4), can be included in the optimization problem in different ways. The first option, which is the simplest, is to include these equality constraints directly as constraints of the LP. There is, however, another method. Observe that, since the two entropy values are equal, we can simply represent them using the same LP variable, instead of generating two different LP variables and then insisting that they are of the same value. This helps reduce the number of LP variables in the problem. In our two working examples, the dependence relations are as follows. • The regenerating code problem: the relations are the following: The first equality implies that: and we can alternatively write Other dependence relations can be converted similarly. This dependence structure can also be represented as a graph, as shown in Figure 3. In this graph, a given node (random variable) is a function of others random variables with an incoming edge. • The caching problem: the relations are the following:
Two remarks are now in order: • For the purpose of deriving outer bounds, it is valid to ignore the symmetry relation altogether, or consider only part of the symmetry relation, as long as the remaining permutations still form a group. For example, in the caching problem, if we only consider the symmetry induced by exchanging the two messages, then we have the first 2 rows instead of the full 12 rows of permutations. Omitting some permutations means less reduction in the LP scale but does not invalidate the computed bounds. • Admittedly, representing the symmetry relation using the above permutation representation is not the most concise approach, and there exists mathematically precise and concise language to specify such structure. We choose this permutation approach because of its simplicity and universality and, perhaps more importantly, due to its suitability for software implementation.

Reducing the Problem Algorithmically via the Disjoint-Set Data Structure
In this section, we first introduce the equivalence relation and classification of joint entropies and then introduce the disjoint-set data structure to identify the classification in an algorithmic manner.
Mathematically, the dependence and the symmetry jointly induce an equivalence relation, and we wish to identify the classification based on this equivalence relation. The key to efficiently form the reduced LP is to identify the mapping from any subset of random variables to an index of the equivalence class it belongs to, i.e., f : 2 {1,2,...,n} → {1, 2, . . . , N * }, (13) where N * is the total number of equivalence classes such induced. In terms of software implementation, the mapping f assigns any subset of the n random variables in the problem to an index, which also serves as the index of the variable in the linear program. More precisely, this mapping provides the fundamental reduction mechanism in the LP formulation, where an elemental constraint of the form becomes the inequality in the resultant LP where the Y's are the variables in the LP; similarly, the elemental constraint

Difficulty in Identifying the Reduction
Following the discussion above, each given subset A ⊆ {1, 2, . . . , n} belongs to an equivalent class of subsets, and an arbitrary element in the equivalent class can be designated (and fixed) as the leader of this equivalent class. To efficiently complete the classification task, we need to be able to find for each given subset A the leader of the equivalent class this subset belongs to. In the example given above, this step is reasonably straightforward. Complications arise when multiple reduction steps are required. To see this, let us consider the set {S 1,2 , S 2,3 , S 4,3 , S 2,1 , S 4,1 }. By the dependence relation {S 1,3 , S 2,3 , S 4,3 } → W 3 , we know H(S 1,3 , S 2,3 , S 4,3 , S 2,1 , S 4,1 ) = H(W 3 , S 1,3 , S 2,3 , S 4,3 , S 2,1 , S 4,1 ).
due to the dependence relation {S 2,1 , S 3,1 , S 4,1 } → W 1 . In this process, we have applied three different dependence relations in the particular order. In a computer program, this implies that we need to iterate over all the dependence relations in the problem to apply the appropriate one and then repeat the process until no further dependence relation can be applied. To make things worse, the symmetry relation would need to be taken into account: for example, we will also need to consider how to recognize one subset to be a permuted version of another subset, as well as whether to do so before or after applying the dependence relation. A naive implementation to find the mapping function f will be highly inefficient.

Disjoint-Set Data Structure and Algorithmic Reduction
The aforementioned difficulty can be resolved using a suitable data structure, namely disjoint-set [8]. A disjoint-set data structure is also called a union-find structure, and, as its name suggests, it stores a collection of disjoint sets. The most well known method to accomplish this task is through a disjoint-set forest [8], which can perform the union operation in constant time, and the find operation (find for an element the index, or the leading element, of the set that it belongs to) in near constant amortized time.
Roughly speaking, the disjoint-set forest in our setting starts with each subset of random variables A ⊆ {1, 2, . . . , n} viewed as its own disjoint set and assigned an index; clearly, we will have a total 2 n − 1 singleton sets at initialization. We iterate through each symmetry permutation and dependence relation as follows: • Symmetry step: For each singleton set (which corresponds to a subset A ⊆ {1, 2, . . . , n}) in the disjoint-set structure, consider each permutation in the symmetry relation: if the permutation maps A into another element (which corresponds to another subset of random variables A ⊆ {1, 2, . . . , n}) not already in the same set in the disjoint-set structure, then we combine the two sets by forming their union. • Dependence step: For each existing set in the disjoint-set structure, consider each dependence relation: if the set leader (which corresponds to a subset A ⊆ {1, 2, . . . , n}) is equivalent to another element due to the given dependence (which corresponds to another subset of random variables A ⊆ {1, 2, . . . , n}) not already in the same set, then we combine the two sets by forming their union.
The key for the efficiency of this data structure is that the union operation is done through pointers, instead of physical memory copy. Moreover, inherent in the data structure is a tree representation of each set; thus, finding the leader index is equivalent to finding the tree root, which is much more efficient than a linear search. The data structure is maintained dynamically during union and find operations, and the height of a tree will be reduced (compressed) when a find operation is performed or when the tree becomes too high.
Clearly, due to the usage of this data structure, the dependence relation does not need to be exhaustively listed because the permuted version of the dependence relation is accounted for automatically. For example, in the regenerating code problem, including only two dependence relations will suffice, when used jointly with the symmetry relations: This replaces the 8 dependence relations given in (6). In the context of our setting, after the disjoint-set forest is found after both the symmetry step and the dependence step, another enumeration step is performed to generate the mapping function f (·), which can be done in time 2 n . In practice, we observe this data structure is able to provide considerable speedup (sometimes up to 50-fold), though the precise speedup factor depends on the problem-specific dependence and symmetry relations case by case.

Four Investigative Techniques
In this section, we introduce four investigative techniques to study fundamental limits of information systems. With the efficient reduction discussed above, these methods are rather powerful tools in such information theoretic studies.

Bounding Plane Optimization and Queries
In this case, the objective function is fixed, and the optimal solution gives an outer bound of a specific linear combination of several information measures or relevant quantities. Figure 4 illustrates the method, where we wish to find a lower bound of the given direction for the given optimal tradeoff shown in red. Let us again consider the two working examples.

•
If the simple sum of the storage cost α and repair cost β, i.e., α + β, needs to be lower-bounded in the regenerating code problem, we can let the objective function be given as and then minimize it. The optimal value will be a lower bound, which in this case turns out to be 5/8. Note that, by taking advantage of the symmetry, the objective function set up above indeed specifies the sum rate of any storage and repair transmission. • If we wish to lower bound the simple sum of memory and rate in the coded caching problem, the situation is somewhat subtle. Note that the rate R is a lower bound on the entropy H(X 1,1,1 ) and H(1, 2, 2); however, the symmetry relation does not imply that H(X 1,1,1 ) = H(X 1,2,2 ). For this case, we can introduce an additional LP variable R and add the constraints that We then set the objective function to be from which the minimum value is a lower bound on the simple sum of memory and rate in this setting.
In addition to simply computing the supporting hyperplane, it is important to extract useful information from the optimal solution. Particularly, we may wish to probe for the values of certain information measures in the optimal solution. For example, in the case above for coded caching, we may be interested in the value of I(Z 1 ; W 1 ), which essentially reveals the amount of information regarding W 1 that is stored in Z 1 in an uncoded form.

Tradeoff and Convex Hull Computation
In many cases, instead of bounding a fixed rate combination, we are interested in the tradeoff of several quantities, most frequently the optimal tradeoff between two quantities; see Figure 4 again for an illustration. The two working examples both belong to this case.
Since the constrained set in the LP is a polytope, the resulting outer bound to the optimal tradeoff will be a piece-wise linear bound. A naive strategy is to trace the boundary by sampling points on a sufficiently dense grid. However, this approach is time consuming and not accurate. Instead the calculation of this piece-wise linear outer bound is equivalent to computing the projection of a convex polytope, for which Lassez's algorithm is in fact a method to complete the task efficiently. We implemented Lassez's algorithm for the projection on to two-dimensional space in this toolbox. A more detailed description of this algorithm can be found in Reference [37], and the specialization used in the program can be found in Reference [4].

Duality and Computer-generated Proof
After identifying a valid outer bound, we sometimes wish to find a proof for this bound. In fact, even if the bound is not optimal, or it is a only hypothesized bound, we may still attempt to prove it. For example, in the regenerating code problem, we have How can we prove this inequality? It is clear from the LP duality that this inequality is a weighted sum of the individual constraints in the LP. Thus, as long as we find one such weighted sum, we can then write down a chain of inequalities directly by combining these inequalities one by one; for a more detailed discussion, see Reference [2,4,17,18].

Sensitivity Analysis
At the computed optimal value, we can probe for the range of certain information measures such that forcing them to be in these ranges does not change the value of the optimal solution. Consider the quantity I(Z 1 ; W 1 ) in the caching problem. It may be possible for it to take values between [0.2, 0.4] without changing the optimal value of the original optimization problem. On the other hand, if it can only take value 0.2, then this suggests that, if a code to achieve this optimal value indeed exists, it must have this amount of uncoded information regarding W 1 stored in Z 1 . This information can be valuable in reverse-engineering optimal codes; see Reference [4] for discussion of such usage.

JSON Problem Descriptions
In the implemented toolbox, the program can read a problem description file (a plain text file), and the desired computed bounds or proof will be produced without further user intervention. In our work, significant effort has been invested in designing an efficient input format, and after a few iterations, a JSON-based format was selected which considerably improves the usability and extendibility of the toolbox. In this section, we provide an example problem description, from which the syntax is mostly self-evident. More details can be found in the documentation accompanying the software [11]. The input problem description files must include the characters PD (which stand for "problem description"), followed by a JSON detailing the problem description.

Keys in PD JSON
The program solves a minimization problem, i.e., to find a lower bound for certain information quantity. There are a total of 12 JSON keys allowed in the problem description: RV, AL, O, D, I, S, BC, BP, QU, SE, CMD, and OPT.
These stand for random variables, additional LP variables, objective function, dependence, independence, symmetry, bound-constant, bound-to-prove, query, sensitivity, command, and options, respectively. For the precise syntax, the readers are referred to the toolbox user manual. We next provide a simple example to illustrate the usage of this toolbox, from which these keywords are self-evident.

An Example Problem Description File
Below is a sample PD file for the regenerating code problem we previously discussed.
The queried quantities are also shown in this part, and it can be seen (α, β) in the LP optimal solution are (0.375, 0.25), together with the values of two other information measures.
The toolbox can also identify the tradeoff between α and β, for which the output is as follows. Here, the three (α, β) pairs are the corner points of the lower convex hull of the tradeoff. To prove this inequality, the toolbox gives: the latter sometimes yields a more concise proof, though not in this case. In practice, it may be preferable to perform one of them to reduce the overall computation.

Conclusions
In this work, we considered computational techniques to investigate fundamental limits of information systems. The disjoint-set data structure was adopted to identify the equivalence class mapping in an algorithmic manner, which is much more efficient than a naive linear enumeration. We provide an open source toolbox for four computational techniques. A JSON format frontend allows the toolbox to read a problem description file, convert it to the corresponding LP, and then produce meaningful bounds and other results directly without user intervention.