The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency

Using first principles from inference, we design a set of functionals for the purposes of ranking joint probability distributions with respect to their correlations. Starting with a general functional, we impose its desired behavior through the Principle of Constant Correlations (PCC), which constrains the correlation functional to behave in a consistent way under statistically independent inferential transformations. The PCC guides us in choosing the appropriate design criteria for constructing the desired functionals. Since the derivations depend on a choice of partitioning the variable space into n disjoint subspaces, the general functional we design is the n-partite information (NPI), of which the total correlation and mutual information are special cases. Thus, these functionals are found to be uniquely capable of determining whether a certain class of inferential transformations, ρ→∗ρ′, preserve, destroy or create correlations. This provides conceptual clarity by ruling out other possible global correlation quantifiers. Finally, the derivation and results allow us to quantify non-binary notions of statistical sufficiency. Our results express what percentage of the correlations are preserved under a given inferential transformation or variable mapping.


Introduction
The goal of this paper is to quantify the notion of global correlations as it pertains to inductive inference.This is achieved by designing a set of functionals from first principles to rank entire probability distributions ρ according to their correlations.Because correlations are relationships defined between different subspaces of propositions (variables), the ranking of any distribution ρ, and hence the type of correlation functional one arrives at, depends on the particular choice of "split" or partitioning of the variable space.Each choice of "split" produces a unique functional for quantifying global correlations, which we call the n-partite information (NPI).
The term correlation may be defined colloquially as being a relation between two or more "things".While we have a sense of what correlations are, how do we quantify this notion more precisely?If correlations have to do with "things" in the real world, are correlations themselves "real?"Can correlations be "physical?"One is forced to address similar questions in the context of designing the relative entropy as a tool for updating probability distributions in the presence of new information (e.g., "What is information?")[1].In the context of inference, correlations are broadly defined as being statistical relationships between propositions.In this paper we adopt the view that whatever correlations may be, their effect is to influence our beliefs about the natural world.Thus, they are interpreted as the information which constitutes statistical dependency.With this identification, the natural setting for the discussion becomes inductive inference.
When one has incomplete information, the tools one must use for reasoning objectively are probabilities [1,2].The relationships between different propositions x and y are quantified by a joint probability density, p(x, y) = p(x|y)p(y) = p(x)p(y|x), where the conditional distribution p(y|x) quantifies what one should believe about y given information about x, and vice-versa for p(x|y).Intuitively, correlations should have something to do with these conditional dependencies.
In this paper, we seek to quantify a global amount of correlation for an entire probability distribution.That is, we desire a scalar functional I[ρ] for the purpose of ranking distributions ρ according to their correlations.Such functionals are not unique since many examples, e.g., covariance, correlation coefficient [3], distance correlation [4], mutual information [5], total correlation [6], maximal-information coefficient [7], etc., measure correlations in different ways.What we desire is a principled approach to designing a family of measures I[ρ] according to specific design criteria [8][9][10].
The idea of designing a functional for ranking probability distributions was first discussed in Skilling [9].In his paper, Skilling designs the relative entropy as a tool for ranking posterior distributions, ρ, with respect to a prior, ϕ, in the presence of new information that comes in the form of constraints (15) (see Section 2.1.3for details).The ability of the relative entropy to provide a ranking of posterior distributions allows one to choose the posterior that is closest to the prior while still incorporating the new information that is provided by the constraints.Thus, one can choose to update the prior in the most minimalist way possible.This feature is part of the overall objectivity that is incorporated into the design of relative entropy and in later versions is stated as the guiding principle [11][12][13].
Like relative entropy, we desire a method for ranking joint distributions with respect to their correlations.Whatever the value of our desired quantifier I[ρ] gives for a particular distribution ρ, we expect that if we change ρ through some generic transformation ( * ), ρ * − → ρ = ρ + δρ, that our quantifier also changes I[ρ] → I[ρ ] = I[ρ] + δI, and that this change of I[ρ] reflects the change in the correlations, i.e., if ρ changes in a way that increases the correlations, then I[ρ] should also increase.Thus, our quantifier should be an increasing functional of the correlations, i.e., it should provide a ranking of ρ's.
The type of correlation functional I[ρ] one arrives at depends on a choice of the splits within the proposition space X, and thus the functional we seek is I[ρ] → I[ρ, X].For example, if one has a proposition space X = X 1 × • • • × X N , consisting of N variables, then one must specify which correlations the functional I[ρ, X] should quantify.Do we wish to quantify how the variable X 1 is correlated with the other N − 1 variables?Or do we want to study the correlations between all of the variables?In our design derivation, each of these questions represent the extremal cases of the family of quantifiers I[ρ, X], the former being a bi-partite correlation (or mutual information) functional and the latter being a total correlation functional.
In the main design derivation we will focus on the the case of total correlation which is designed to quantify the correlations between every variable subspace X i in a set of variables X = X 1 × • • • × X N .We suggest a set of design criteria (DC) for the purpose of designing such a tool.These DC are guided by the Principle of Constant Correlations (PCC), which states that "the amount of correlations in ρ should not change unless required by the transformation, (ρ, X) * → (ρ , X )."This implies our design derivation requires us to study equivalence classes of [ρ] within statistical manifolds ∆ under the various transformations of distributions ρ that are typically performed in inference tasks.We will find, according to our design criteria, that the global quantifier of correlations we desire in this special case is equivalent to the total correlation [6].
Once one arrives at the TC as the solution to the design problem in this article, one can then derive special cases such as the mutual information [5] or, as we will call them, any n-partite information (NPI), which measures the correlations shared between generic n-partitions of the proposition space.The NPI and the mutual information (or bi-partite information) can be derived using the same principles as the TC except with one modification, as we will discuss in Section 5.
The special case of NPI when n = 2 is the bipartite (or mutual) information, which quantifies the amount of correlations present between two subsets of some proposition space X.Mutual information (MI) as a measure of correlation has a long history, beginning with Shannon's seminal work on communication theory [14] in which he first defines it.While Shannon provided arguments for the functional form of his entropy [14], he did not provide a derivation of (MI).Despite this, there has still been no principled approach to the design of MI or for the total correlation TC.Recently however, there has been an interest in characterizing entropy through a category theoretic approach (see the works of Baez et al. [15]).The approach by Baez et al. shows that a particular class of functors from the category FinStat, which is a finite set equipped with a probability distribution, are scalar multiples of the entropy [15].The papers by Baudot et al. [16][17][18] also take a category theoretical approach however their results are more focused on the topological properties of information theoretic quantities.Both Baez et al. and Baudot et al. discuss various information theoretic measures such as the relative entropy, mutual information, total correlation, and others.
The idea of designing a tool for the purpose of inference and information theory is not new.Beginning in [2], Cox showed that probabilities are the functions that are designed to quantify "reasonable expectation" [19], of which Jaynes [20] and Caticha [10] have since improved upon as "degrees of rational belief".Inspired by the method of maximum entropy [20][21][22], there have been many improvements on the derivation of entropy as a tool designed for the purpose of updating probability distributions in the decades since Shannon [14].Most notably they are by Shore and Johnson [8], Skilling [9], Caticha [11], and Vanslette [12,13].The entropy functionals in [11][12][13] are designed to follow the Principle of Minimal Updating (PMU), which states, for the purpose of enforcing objectivity, that "a probability distribution should only be updated to the extent required by the new information."In these articles, information is defined operationally ( * ) as that which induces the updating of the probability distributions, ϕ * → ρ.An important consequence of deriving the various NPI as tools for ranking is their immediate application to the notion of statistical sufficiency.Sufficiency is a concept that dates back to Fisher, and some would argue Laplace [23], both of whom were interested in finding statistics that contained all relevant information about a sample.Such statistics are called sufficient, however this notion is only a binary label, so it does not quantify an amount of sufficiency.Using the result of our design derivation, we can propose a new definition of sufficiency in terms of a normalized NPI.Such a quantity gives a sense of how close a set of functions are to being sufficient statistics.This topic will be discussed in Section 6.
In Section 2 we will lay out some mathematical preliminaries and discuss the general transformations in statistical manifolds we are interested in.Then in Section 3, we will state and discuss the design criteria used to derive the functional form of TC and the NPI in general.In Section 4 we will complete the proof of the results from Section 3. In Section 5 we discuss the n-partite (NPI) special cases of TC of which the bipartite case is the mutual information, which is discussed in Section 5.2.In Section 6 we will discuss sufficiency and its relation to the Neyman-Pearson lemma [24].It should be noted that throughout this article we will be using a probabilistic framework in which x ∈ X denotes propositions of a probability distribution rather than a statistical framework in which x denotes random numbers.

Mathematical Preliminaries
The arena of any inference task consists of two ingredients, the first of which is the subject matter, or what is often called the universe of discourse.This refers to the actual propositions that one is interested in making inferences about.Propositions tend to come in two classes, either discrete or continuous.Discrete proposition spaces will be denoted by calligraphic uppercase Latin letters, X , and the individual propositions will be lowercase Latin letters x i ∈ X indexed by some variable i = {1, . . ., |X |}, where |X | is the number of distinct propositions in X .In this paper we will mostly work in the context of continuous propositions whose spaces will be denoted by bold faced uppercase Latin letters, X, and whose elements will simply be lowercase Latin letters with no indices, x ∈ X. Continuous proposition spaces have a much richer structure than discrete spaces (due to the existence of various differentiable structures, the ability to integrate, etc.) and help to generalize concepts such as relative entropy and information geometry [10,25,26] (Common examples of discrete proposition spaces are the results of a coin flip or a toss of a die, while an example of a continuous proposition space is the position of a particle [27].).
The second ingredient that one needs to define for general inference tasks is the space of models, or the space of probability distributions which one wishes to assign to the underlying proposition space.These spaces can often be given the structure of a manifold, which in the literature is called a statistical manifold [10].A statistical manifold ∆, is a manifold in which each point ρ ∈ ∆ is an entire probability distribution, i.e., ∆ is a space of maps from subsets of X to the interval [0, 1], ρ : P (X) → [0, 1].The notation P (X) denotes the power set of X, which is the set of all subsets of X, and has cardinality equal to |P (X)| = 2 |X| .
In the simplest cases, when the underlying propositions are discrete, the manifold is finite dimensional.A common example that is used in the literature is the three-sided die, whose distribution is determined by three probability values ρ = {p 1 , p 2 , p 3 }.Due to positivity, p i ≥ 0, and the normalization constraint, ∑ i p i = 1, the point ρ lives in the 2-simplex.Likewise, a generic discrete statistical manifold with n possible states is an (n − 1)-simplex.In the continuum limit, which is often the case explored in physics, the statistical manifold becomes infinite dimensional and is defined as (Throughout the rest of the paper, we use the Greek ρ to represent a generic distribution in ∆, and we use the Latin p(x) to refer to an individual density.), When the statistical manifold is parameterized by the densities p(x), the zeroes always lie on the boundary of the simplex.In this representation the statistical manifolds have a trivial topology; they are all simply connected.Without loss of generality, we assume that the statistical manifolds we are interested in can be represented as (1), so that ∆ is simply connected and does not contain any holes.The space ∆ in this representation is also smooth.
The symbol ρ defines what we call a state of knowledge about the underlying propositions X.It is, in essence, the quantification of our degrees of belief about each of the possible propositions x ∈ X [2].The correlations present in any distribution ρ necessarily depend on the conditional relationships between various propositions.For instance, consider the binary case of just two proposition spaces X and Y, so that the joint distribution factors, p(x, y) = p(x)p(y|x) = p(y)p(x|y). ( The correlations present in p(x, y) will necessarily depend on the form of p(x|y) and p(y|x) since the conditional relationships tell us how one variable is statistically dependent on the other.As we will see, the correlations defined in Equation ( 2) are quantified by the mutual information.For situations of many variables however, the global correlations are defined by the total correlation, which we will design first.All other measures which break up the joint space into conditional distributions (including (2)) are special cases of the total correlation.

Some Classes of Inferential Transformations
There are four main types of transformations we will consider that one can enact on a state of knowledge ρ * → ρ .They are: coordinate transformations, entropic updating (This of course includes Bayes rule as a special case [28,29]), marginalization, and products.This set of transformations is not necessarily exhaustive, but is sufficient for our discussion in this paper.We will indicate whether or not each of these types of transformations can presumably cause changes to the amount of global correlations, or not, by evaluating the response of the statistical manifold under these transformations.Our inability to describe how much the amount of correlation changes under these transformations motivates the design of such an objective global quantifier.
The types of transformations we will explore can be identified either with maps from a particular statistical manifold to itself, ∆ → ∆ (type I), to a subset of the original manifold ∆ → ∆ ⊆ ∆ (type II), or from one statistical manifold to another, ∆ → ∆ (type III and IV).

Type I: Coordinate Transformations
Type I transformations are coordinate transformations.A coordinate transformation f : X → X , is a special type of transformation of the proposition space X that respects certain properties.It is essentially a continuous version of a reparameterization (A reparameterization is an isomorphism between discrete proposition spaces, g : X → Y which identifies for each proposition x i ∈ X , a unique proposition y i ∈ Y so that the map g is a bijection.).For one, each proposition x ∈ X must be identified with one and only one proposition x ∈ X and vice versa.This means that coordinate transformations must be bijections on proposition space.The reason for this is simply by design, i.e., we would like to study the transformations that leave the proposition space invariant.A general transformation of type I on ∆ which takes X to X = f (X), is met with the following transformation of the densities, p(x) I − → p (x ) where p(x)dx = p (x )dx . ( Like we already mentioned, the coordinate transforming function f : X → X must be a bijection in order for (3) to hold, i.e., the map While the densities p(x) and p (x ) are not necessarily equal, the probabilities defined in (3) must be (according to the rules of probability theory, see the Appendix A).This indicates that ρ I → ρ = ρ is in the same location in the statistical manifold.That is, the global state of knowledge has not changed-what has changed is the way in which the local information in ρ has been expressed, which must be invertible in general.
While one could impose that the transformations f be diffeomorphisms (i.e., smooth maps between X and X ), it is not necessary that we restrict f in this way.Without loss of generality, we only assume that the bijections f ∈ C 0 (X) are continuous.For discussions involving diffeomorphism invariance and statistical manifolds see the works of Amari [25], Ay et al. [30] and Bauer et al. [31].
For a coordinate transformation (3) involving two variables, x ∈ X and y ∈ Y, we also have that type one transformations give, p(x, y) I − → p (x , y ) where p(x, y)dxdy = p (x , y )dx dy .
A few general properties of these type I transformations are as follows: First, the density p(x, y) is expressed in terms of the density p (x , y ), where γ(x , y ) is the determinant of the Jacobian [10] that defines the transformation, For a finite number of variables x = (x 1 , . . ., x N ), the general type I transformations p(x 1 , . . . ,x N ) I − → p (x 1 , . . ., x N ) are written, and the Jacobian becomes, One can also express the density p (x ) in terms of the original density p(x) by using the inverse transform, In general, since coordinate transformations preserve the probabilities associated to a joint proposition space, they also preserve several structures derived from them.One of these is the Fisher-Rao (information) metric [25,31,32], which was proved by Čencov [26] to be the unique metric on statistical manifolds that represents the fact that the points ρ ∈ ∆ are probability distributions and not structureless [10] (For a summary of various derivations of the information metric, see [10] Section 7.4).

Split Invariant Coordinate Transformations
Consider a class of coordinate transformations that result in a diagonal Jacobian matrix, i.e., These transformations act within each of the variable spaces independently, and hence they are guaranteed to preserve the definition of the split between any n-partitions of the propositions, and because they are coordinate transformations, they are invertible and do not change our state of knowledge, (ρ, X) Ia − → (ρ , X ) = (ρ, X).We call such special types of transformations (10) split invariant coordinate transformations and denote them as type Ia.From (10), it is obvious that the marginal distributions of ρ are preserved under split invariant coordinate transformations, If one allows generic coordinate transformations of the joint space, then the marginal distributions may depend on variables outside of their original split.Thus, if one redefines the split after a coordinate transformation to new variables X → X , the original problem statement changes as to what variables we are considering correlations between and thus Equation (11) no longer holds.This is apparent in the case of two variables (x, y), where x = f x (x, y), since, which depends on y.In the situation where x and y are independent, redefining the split after the coordinate transformation (12) breaks the original independence since the distribution that originally factors, p(x, y) = p(x)p(y), would be made to have conditional dependence in the new coordinates, i.e., if x = f x (x, y) and y = f y (x, y), then, p(x, y) = p(x)p(y) So, even though the above transformation satisfies (3), this type of transformation may change the correlations in ρ by allowing for the potential redefinition of the split X → X .Hence, when designing our functional, we identify split invariant coordinate transformations as those which preserve correlations.These restricted coordinate transformations help isolate a single functional form for our global correlation quantifier.

Type II: Entropic Updating
Type II transformations are those induced by updating [10], ϕ * − → ρ in which one maximizes the relative entropy, subject to constraints and relative to the prior, q(x).Constraints often come in the form of expectation values [10,[20][21][22], A special case of these transformations is Bayes' rule [28,29], In ( 14) and throughout the rest of the paper we will use log base e (natural log) for all logarithms, although the results are perfectly well defined for any base (the quantities S[ρ, ϕ] and I[ρ, X] will simply differ by an overall scale factor when using different bases).Maximizing (14) with respect to constraints such as (15) induces a jump in the statistical manifold.Type II transformations, while well defined, are not necessarily continuous, since in general one can map nearby points to disjoint subsets in ∆.Type II transformations will also cause ρ II → ρ = ρ in general as it jumps within the statistical manifold.This means, because different ρ's may have different correlations, that type II transformations can either increase, decrease, or leave the correlations invariant.

Type III: Marginalization
Type III transformations are induced by marginalization, p(x, y) which is effectively a quotienting of the statistical manifold, ∆(x) = ∆(x, y)/ ∼ y , i.e., for any point p(x), we equivocate all values of p(y|x).Since the distribution ρ changes under type III transformations, ρ III − → ρ , the amount of correlations can change.

Type IV: Products
Type IV transformations are created by products, p(x) which are a kind of inverse transformation of type III, i.e., the set of propositions X becomes the product X × Y.There are many different situations that can arise from this type, a most trivial one being an embedding, p(x) which can be useful in many applications.The function δ(•) in the above equation is the Dirac delta function [33] which has the following properties, We will denote such a transformation as type IVa.Another trivial example of type IV is, which we will call type IVb.Like type II, generic transformations of type IV can potentially create correlations, since again we are changing the underlying distribution.

Remarks on Inferential Transformations
There are many practical applications in inference which make use of the above transformations by combining them in a particular order.For example, in machine learning and dimensionality reduction, the task is often to find a low-dimensional representation of some proposition space X, which is done by combining types I,III and IVa in the order, ρ Neural networks are a prime example of this sequence of transformations [34].Another example of IV,I,III transformations are convolutions of probability distributions, which takes two proposition spaces and combines them into a new one [5].
In Appendix C we discuss how our resulting design functionals behave under the aforementioned transformations.

Designing a Global Correlation Quantifier
In this section we seek to achieve our design goal for the special case of the total correlation, Design Goal: Given a space of N variables X = X 1 × • • • × X N and a statistical manifold ∆, we seek to design a functional I[ρ, X] which ranks distributions ρ ∈ ∆ according to their total amount of correlations.
Unlike deriving a functional, designing a functional is done through the process of eliminative induction.Derivations are simply a means of showing consistency with a proposed solution whereas design is much deeper.In designing a functional, the solution is not assumed but rather achieved by specifying design criteria that restrict the functional form in a way that leads to a unique or optimal solution.One can then interpret the solution in terms of the original design goal.Thus, by looking at the "nail", we design a "hammer", and conclude that hammers are designed to knock in and remove nails.We will show that there are several paths to the solution of our design criteria, the proof of which is in Section 4.
Our design goal requires that I[ρ, X] be scalar valued such that we can rank the distributions ρ according to their correlations.Considering a continuous space which depends on each of the possible probability values for every x ∈ X (In Watanabe's paper [6], the notation for the total correlation between a set of variables λ is written as where S(λ i ) is the Shannon entropy of the subspace λ i ⊆ λ.For a proof of Watanabe's theorem see Appendix B).
Given the types of transformations that may be enacted on ρ, we state the main guiding principle we will use to meet our design goal,
While simple, the PCC is incredibly constraining.By stating when one should not change the correlations, i.e., I[ρ, X] able to complete our design goal, then we will be able to uniquely quantify how transformations of type I-IV affect the amount of correlations in ρ.
The discussion of type I transformations indicate that split invariant coordinate transformations do not change (ρ, X).This is because we want to not only maintain the relationship among the joint distribution (3), but also the relationships among the marginal spaces, Only then are the relationships between the n-partitions guaranteed to remain fixed and hence the distribution ρ remains in the same location in the statistical manifold.When a coordinate transformation of this type is made, because it does not change (ρ, X), we are not explicitly required to change I[ρ, X], so by the PCC we impose that it does not.
The PCC together with the design goal implies that, Corollary 1 (Split Coordinate Invariance).The coordinate systems within a particular split are no more informative about the amount of correlations than any other coordinate system for a given ρ.
This expression is somewhat analogous to the statement that "coordinates carry no information", which is usually stated as a design criterion for relative entropy [8,9,11] (This appears as axiom two in Shore and Johnson's derivation of relative entropy [8], which is stated on page 27 as "II.Invariance: The choice of coordinate system should not matter."In Skilling's approach [9], which was mainly concerned with image analysis, axiom two on page 177 is justified with the statement "We expect the same answer when we solve the same problem in two different coordinate systems, in that the reconstructed images in the two systems should be related by the coordinate transformation."Finally, in Caticha's approach [11], the axiom of coordinate invariance is simply stated on page 4 as "Criterion 2: Coordinate invariance.The system of coordinates carries no information.").
To specify the functional form of I[ρ, X] further, we will appeal to special cases in which it is apparent that the PCC should be imposed [9].The first involves local, subdomain, transformations of ρ.If a subdomain of X is transformed then one may be required to change its amount of correlations by some specified amount.Through the PCC however, there is no explicit requirement to change the amount of correlations outside of this domain, hence we impose that those correlations outside are not changed.The second special case involves transformations of an independent subsystem.If a transformation is made on an independent subsystem then again by the PCC, because there is no explicit reason to change the amount of correlations in the other subsystem, we impose that they are not changed.We denote these two types of transformation independences as our two design criteria (DC).
Surprisingly, the PCC and the DC are enough to find a general form for I[ρ, X] (up to an irrelevant scale constant).As we previously stated, the first design criteria concerns local changes in the probability distribution ρ.

Design Criterion 1 (Locality). Local transformations of ρ contribute locally to the total amount of correlations.
The term locality has been invoked to mean many different things in different fields (e.g., physics, statistics, etc.).In this paper, as well as in [8,9,[11][12][13], the term local refers to transformations which are constrained to act only within a particular subdomain D ⊂ X, i.e., the transformations of the probabilities are local to D and do not affect probabilities outside of this domain.Essentially, if new information does not require us to change the correlations in a particular subdomain D ⊂ X, then we do not change the probabilities over that subdomain.While simple, this criterion is incredibly constraining and leads (22) to the functional form, where F is some undetermined function of the probabilities and possibly the coordinates.We have used dx = dx 1 . . .dx N to denote the measure for brevity.To constrain F further, we first use the corollary of split coordinate invariance (1) among the subspaces X i ⊂ X and then apply special cases of particular coordinate transformations.This leads to the following functional form, which demonstrates that the integrand is independent of the actual coordinates themselves.Like coordinate invariance, the axiom DC1 also appears in the design derivations of relative entropy [8,9,[11][12][13]] (In Shore and Johnson's approach to relative entropy [8], axiom four is analogous to our locality criteria, which states on page 27 "IV.Subset Independence: It should not matter whether one treats an independent subset of system states in terms of a separate conditional density or in terms of the full system density."In Skilling's approach [9] locality appears as axiom one which, like Shore and Johnson's axioms, is called Subset Independence and is justified with the following statement on page 175, "Information about one domain should not affect the reconstruction in a different domain, provided there is no constraint directly linking the domains."In Caticha [11] the axiom is also called Locality and is written on page four as "Criterion 1: Locality.Local information has local effects."Finally, in Vanslette's work [12,13], the subset independence criteria is stated on page three as follows, "Subdomain Independence: When information is received about one set of propositions, it should not effect or change the state of knowledge (probability distribution) of the other propositions (else information was also received about them too).").This leaves the function Φ to be determined, which can be done by imposing an additional design criteria.
Design Criterion 2 (Subsystem Independence).Transformations of ρ in one independent subsystem can only change the amount of correlations in that subsystem.
The consequence of DC2 concerns independence among subspaces of X.Given two subsystems (X 1 × X 2 ) × (X 3 × X 4 ) ≡ X 12 × X 34 = X which are independent, the joint distribution factors, We will see that this leads to the global correlations being additive over each subsystem, Like locality (DC1), the design criteria concerning subsystem independence appears in all four approaches to relative entropy [8,9,[11][12][13]] (In Shore and Johnson's approach [8], axiom three concerns subsystem independence and is stated on page 27 as "III.System Independence: It should not matter whether one accounts for independent information about independent systems separately in terms of different densities or together in terms of a joint density."In Skillings approach [9], the axiom concerning subsystem independence is given by axiom three on page 179 and provides the following comment on page 180 about its consequences "This is the crucial axiom, which reduces S to the entropic form.The basic point is that when we seek an uncorrelated image from marginal data in two (or more) dimensions, we need to multiply the marginal distributions.On the other hand, the variational equation tells us to add constraints through their Lagrange multipliers.Hence the gradient δS/δ f must be the logarithm."In Caticha's design derivation [11], axiom three concerns subsystem independence and is written on page 5 as "Criterion 3: Independence.When systems are known to be independent it should not matter whether they are treated separately or jointly."Finally, in Vanslette [12,13] on page 3 we have "Subsystem Independence: When two systems are a priori believed to be independent and we only receive information about one, then the state of knowledge of the other system remains unchanged.");however, due to the difference in the design goal here, we end up imposing DC2 closer to that of the work of [12,13] as we do not explicitly have the Lagrange multiplier structure in our design space.
Imposing DC2 leads to the final functional form of I[ρ, X], with p(x i ) being the split dependent marginals.This functional is what is typically referred to as the total correlation (The concept of total correlation TC was first introduced in Watanabe [6] as a generalization to Shannon's definition of mutual information.There are many practical applications of TC in the literature [35][36][37][38].) and is the unique result obtained from imposing the PCC and the corresponding design criteria.
As was mentioned throughout, these results are usually implemented as design criteria for relative entropy as well.Shore and Johnson's approach [8] presents four axioms, of which III and IV are subsystem and subset independence.Subset independence in their framework corresponds to Equation (24) and to the Locality axiom of Caticha [11].It also appears as an axiom in the approaches by Skilling [9] and Vanslette [12,13].Subsystem independence is given by axiom three in Caticha's work [11], axiom two in Vanslette's [12,13] and axiom three in Skilling's [9].While coordinate invariance was invoked in the approaches by Skilling, Shore and Johnson and Caticha, it was later found to be unnecessary in the work by Vanslette [12,13] who only required two axioms.Likewise, we find that it is an obvious consequence of the PCC and does not need to be stated as a separate axiom in our derivation of the total correlation.
The work by Csiszár [39] provides a nice summary of the various axioms used by many authors (including Azcél [40], Shore and Johnson [8] and Jaynes [21]) in their definitions of information theoretic measures (A list is given on page 3 of [39] which includes the following for conditions on an entropy function H(P); (1) Positivity (H(P) ≥ 0), (2) Expansibility ("expansion" of P by a new component equal to 0 does not change H(P), i.e., embedding in a space in which the probabilities of the new propositions are zero), (3) Symmetry (H(P) is invariant under permutation of the probabilities), (4) Continuity (H(P) is a continuous function of P), (5) ) and ( 9) Sum property (H(P) = ∑ n i=1 g(p i ) for some function g).).One could associate the design criteria in this work to some of the common axioms enumerated in [39], although some of them will appear as consequences of imposing a specific design criterion, rather than as an ansatz.For example, the strong additivity condition (see Appendices C.1.3and C.1.4) is the result of imposing DC1 and DC2.Likewise, the condition of positivity (i.e., I[ρ, X] ≥ 0) and convexity occurs as a consequence of the design goal, split coordinate invariance (SCI) and both of the design criteria.Continuity of I[ρ, X] with respect to ρ is imposed through the design goal, and symmetry is a consequence of DC1.In summary, Design Goal→continuity, DC1→symmetry, (DC1 + DC2)→strong additivity, (Design Goal + SCI + DC1 + DC2)→positivity + convexity.As was shown by Shannon [14] and others [39,40], various combinations of these axioms, as well as the ones mentioned in footnote 11, are enough to characterize entropic measures.
One could argue that we could have merely imposed these axioms at the beginning to achieve the functional I[ρ, X], rather than through the PCC and the corresponding design criteria.The point of this article however, is to design the correlation functionals by using principles of inference, rather than imposing conditions on the functional directly (This point was also discussed in the conclusion section of Shore and Johnson [8] see page 33.).In this way, the resulting functionals are consequences of employing the inference framework, rather than postulated arbitrarily.
One will recognize that the functional form of (28) and the corresponding n-partite informations (88) have the form of a relative entropy.Indeed, if one identifies the product marginal ∏ N i=1 p(x i ) as a prior distribution as in (14), then it may be possible to find constraints (15) which update the product marginal to the desired joint distribution p(x).One can then interpret the constraints as the generators of the correlations.We leave the exploration of this topic to a future publication.

Proof of the Main Result
We will prove the results summarized in the previous section.Let a proposition of interest be represented by x i ∈ X -an N dimensional coordinate x i = (x 1 i , . . ., x N i ) that lives somewhere in the discrete and fixed proposition space X = {x 1 , . . ., x i , . . ., x |X | }, with |X | being the cardinality of X (i.e., the number of possible combinations).The joint probability distribution at this generic location is P(x i ) ≡ P(x 1 i , . . ., x N i ) and the entire distribution ρ is the set of joint probabilities defined over the space X , i.e., ρ ≡ {P(x 1 ), . . ., P(x |X | )} ∈ ∆.

Locality-DC1
We begin by imposing DC1 on I[ρ, X ].Consider changes in ρ induced by some transformation ( * ), where the change to the state of knowledge is, for some arbitrary change δρ in ∆ that is required by some new information.This implies that the global correlation function must also change according to (22), where δI is the change to I[ρ, X ] induced by (29) for small changes δP(x d ).In general the derivatives in (31) are functions of the probabilities, which could potentially depend on the entire distribution ρ as well as the point x d ∈ X .We impose DC1 by constraining (32) to only depend on the probabilities within the subdomain D since the variation (32) should not cause changes to the amount of correlations in the complement D, i.e., This condition must also hold for arbitrary choices of subdomains D, thus by further imposing DC1 in the most restrictive case of local changes (D = x d ), guarantees that it will hold in the general case.In this most restrictive case of local changes, the functional I[ρ, X ] has vanishing mixed derivatives, Integrating (34) leads to, where the {F i } are undetermined functions of the probabilities and the coordinates.As this functional is designed for ranking, nothing prevents us from setting the irrelevant constant to zero, which we do.
Extending to the continuum, we find Equation ( 24), where for brevity we have also condensed the notation for the continuous N dimensional variables x = {x 1 , . . ., x N }.It should be noted that F(p(x), x) has the capacity to express a large variety of potential measures of correlation including Pearson's [3] and Szekely's [4] correlation coefficients.Our new objective is to use eliminative induction until only a unique functional form for F remains.

Split Coordinate Invariance-PCC
The PCC and the corollary (1) state that I[ρ, X], and thus F (p(x), x), should be independent of transformations that keep (ρ, X) * − → (ρ , X ) = (ρ, X) fixed.As discussed, split invariant coordinate transformations (10) satisfy this property.We will further restrict the functional I[ρ, X] so that it obeys these types of transformations.
We can always rewrite the expression (37) by introducing densities m(x) and p(x) so that, Then, instead of dealing with the function F directly, we can instead deal with a new definition Φ, where Φ is defined as, Now we further restrict the functional form of Φ by appealing to the PCC.Consider the functional I[ρ, X] under a split invariant coordinate transformation, (x 1 , . . ., x N ) → (x 1 , . . ., x N ) ⇒ m(x)dx = m (x )dx , and p(x)dx = p (x )dx , (41) which amounts to sending Φ to, where γ(x) = ∏ N i γ(x i ) is the Jacobian for the transformation from (x 1 , . . ., x N ) to ( f 1 (x 1 ), . . ., f N (x N )).Consider the special case in which the Jacobian γ(x) = 1.Then due to the PCC we must have, However this would suggest that I[ρ, X] Ia − → I[ρ , X ] = I[ρ, X] since correlations could be changed under the influence of the new variables x ∈ X .Thus in order to maintain the global correlations the function Φ must be independent of the coordinates, To constrain the form of Φ further, we can again appeal to split coordinate invariance but now with arbitrary Jacobian γ(x) = 1, which causes Φ to transform as, But this must hold for arbitrary split invariant coordinate transformations, for when the Jacobian factor γ(x) = 1.Hence, the function Φ must also be independent of the second and third argument, We then have that the split coordinate invariance suggested by the PCC together with DC1 gives, This is similar to the steps found in the relative entropy derivation [8,11], but differs from the steps in [12,13].

I min -Design Goal and PCC
Split coordinate invariance, as realized in Equation ( 47), provides an even stronger restriction on I[ρ, X] which we can find by appealing to a special case.Since all distributions with the same correlations should have the same value of I[ρ, X] by the Design Goal and PCC, then all independent joint distributions ϕ will also have the same value, which by design takes a unique minimum value, Requiring independent joint distributions ϕ return a unique minimum I[ϕ, X] = I min is similar to imposing a positivity condition on I[ρ, X] [39].We will find however, that positivity only arises once DC2 has been taken into account.Here, I min could be any value, so long as when one introduces correlations, ϕ * − → ρ the value of I[ρ, X] always increases from I min .This condition could also be imposed as a general convexity property of I[ρ, X], however this is already required by the design goal and does not require an additional axiom.
Inserting (48) into (47) we find, But this expression must be independent of the underlying distribution p(x) = ∏ N i=1 p(x i ), since all independent distributions, regardless of the joint space X, must give the same value I min .Thus we conclude that the density m(x) must be the product marginal m(x) = ∏ N i=1 p(x i ), where so it is guaranteed that, Thus, by design, expression ( 47) becomes (25),

Subsystem Independence-DC2
In the following subsections we will consider two approaches for imposing subsystem independence via the PCC and DC2.Both lead to identical functional expressions for I[ρ, X].The analytic approach assumes the functional form of Φ may be expressed as a Taylor series.The algebraic approach reaches the same conclusion without this assumption.

Analytical Approach
Let us assume that the function Φ is analytic, so that it can be Taylor expanded.Since the argument, p(x)/ ∏ N i=1 p(x i ) is defined over [0, ∞), we can consider the expansion over some open set of [0, ∞) for any particular value p 0 (x)/ ∏ N i=1 p 0 (x i ) as, where Φn are real coefficients.For p(x)/ ∏ N i=1 p(x i ) in the neighborhood of p 0 (x)/ ∏ N i=1 p 0 (x i ), the series (53) converges to Φ p(x)/ ∏ N i=1 p(x i ) .The Taylor expansion of Φ p(x) ∏ N i=1 p(x i ) about p(x) when its propositions are nearly independent, i.e., p(x) where the upper index (n) denotes the nth-derivative, The 0th term is = Φ min by definition of the design goal, which leaves, where the + in Φ + refers to (n) > 0.
Consider the independent subsystem special case in which p(x) is factorizable into p(x) = p(x 1 , x 2 )p(x 3 , x 4 ), for all x ∈ X.We can represent Φ + with an analogous two-dimensional Taylor expansion in p(x 1 , x 2 ) and p(x 3 , x 4 ), which is, where the mixed derivative term is, Since transformations of one independent subsystem, ρ 12 * − → ρ 12 or ρ 34 * − → ρ 34 , must leave the other invariant by the PCC and subsystem independence, then DC2 requires that the mixed derivatives should necessarily be set to zero, Φ (n 1 ,n 2 ) p(x 1 ,x 2 )p(x 3 ,x 4 ) = 0.This gives a functional equation for Φ + , where Φ + 1 corresponds to the terms involving X 1 and X 2 and Φ + 2 corresponds to the terms involving X 3 and X 4 .Including Φ min = Φ(1) from the n 1 = 0 and n 2 = 0 cases we have in total that, To determine the solution of this equation we can appeal to the special case in which both subsystems are independent, p(x 1 , x 2 ) = p(x 1 )p(x 2 ), and p(x 3 , x 4 ) = p(x 3 )p(x 4 ) which amounts to, which means that either Φ min = 0 or Φ min = ±∞, however the latter two solutions are ruled out by the design goal since setting the minimum to +∞ makes no sense, and setting it to −∞ does not allow for ranking as it implies Φ = −∞ for all finite values of Φ + , which would violate the Design Goal.Further, Φ min = −∞ would imply that the minimum would not be a well defined constant number Φ, which violates (51).Thus, by eliminative induction and following our design method, it follows that Φ min = Φ(1) must equal 0. The general equation for Φ having two independent subsystems ρ = ρ 12 ρ 34 is, or with the arguments, If subsystem ρ 34 = ρ 3 ρ 4 is itself independent, it implies but due to commutativity, this is also, This implies the functional form of Φ does not have dependence on the particular subsystem Φ = Φ + 1 = Φ + 2 in general.This gives the following functional equation for Φ, The solution to this functional equation is the log, where A is an arbitrary constant.Setting A = 1, the global correlation functional increases in the amount of correlations, which is (28).
The result in (66) could be imposed as an additivity condition on the functional I[ρ, X] = I[ρ 12 , X 12 ] + I[ρ 34 , X 34 ] [39].In general however, the correlation functional I[ρ, X] obeys the stricter strong additivity condition, which we have no reason a priori to impose.Here, the strong additivity condition is instead an end result, realized as a consequence of imposing the PCC through the various design criteria.

Algebraic Approach
Here we present an alternative algebraic approach to imposing DC2.Consider the case in which subsystem two is independent, ρ 34 → ϕ 34 = p(x 3 , x 4 ) = p(x 3 )p(x 4 ) (The notation ϕ in place of the usual ρ for a distribution is meant to represent independence among it's subsystems.),and ρ = ρ 12 ϕ 34 .This special case is, which holds for all product forms of ϕ 34 that have no correlations and for all possible transformations of ρ 12 * − → ρ 12 .jointAlternatively, we could have considered the situation in which subsystem one is independent, ). Analogously, this case implies, which holds for all product forms of ϕ 12 that have no correlations and for all possible transformations of ρ 34 * − → ρ 34 .The consequence of these considerations is that, in principle, we have isolated the amount of correlations of either system.Imposing DC2 is requiring that the amount of correlations in either subsystem cannot be affected by changes in correlations in the other.This implies that for general Consider a variation of ρ 1 where ρ 2 is held fixed, which induces a change in Now consider a variation of ρ 1 at any other value of the second subsystem, It follows from DC2 that transformations in one independent subsystem should not change the amount of correlations in another independent subsystem due to the PCC.However, for the same δρ 1 , the current functional form (72) allows for δI [ρ 1 ρ 2 , X] /δρ 1 at one value of ρ 2 to differ from δI [ρ 1 ρ 2 , X] /δρ 1 at another, which implies that the amount of correlations induced by the change δρ 1 depends on the value of ρ 2 .Imposing DC2 is therefore enforcing that functionally the amount of change in the correlations satisfies for any value of ρ 2 , i.e., that the variations must be independent too.This similarly goes for variations with respect to ρ 2 where ρ 1 is kept fixed, which implies that (71) must be linear since, The general solution to this differential equation is, joint We now seek the constants a, b, c. Because for N independent subsystems we find, and therefore the constant c must satisfy, Because a, c, and I min are all constants they should not depend on the number of independent subsystems N. Thus, for another distribution ρ which contains M = N independent subsystems, which implies N = M, which ca not be realized by definition.This implies the only solution is I min = c = 0, which is in agreement with the analytic approach.One then uses (69), and finds a = 1.This gives a functional equation for Φ, At this point the solution follows from Equation (67) so that I[ρ] is (28),

The n-Partite Special Cases
In the previous sections of the article, we designed an expression that quantifies the global correlations present within an entire probability distribution and found this quantity to be identical to the total correlation (TC).Now we would like to discuss partial cases of the above in which one does not consider the information shared by the entire set of variables X, but only information shared across particular subsets of variables in X.These types of special cases of TC measure the n-partite correlations present for a given distribution ρ.We call such functionals an n-partite information, or NPI.
Given a set of N variables in proposition space, X = X 1 × • • • × X N , an n-partite subset of X consists of n ≤ N subspaces {X (k) } n ≡ {X (1) , ..., X (k) , ..., X (n) } which have the following collectively exhaustive and mutually exclusive properties, The special case of (83) for any n-partite splitting will be called the n-partite information and will be denoted by I[ρ, X (1) ; . . .; X (n) ] with (n − 1) semi-colons separating the partitions.The largest number n that one can form for any variable set X is simply the number of variables present in X and for this largest set the n-partite information coincides with the total correlation, Each of the n-partite informations can be derived in a manner similar to the total correlation, except where the density m(x) in step ( 52) is replaced with the appropriate independent density associated to the n-partite system, i.e., m(x) Thus, the split invariant coordinate transformation (10) becomes one in which each of the partitions in variable space gives an overall block diagonal Jacobian, (In the simplest case, for N dimensions and n = 2 partitions, the Jacobian matrix is block diagonal in the partitions J(X (1) , X (2) ) = J(X (1) ) ⊕ J(X (2) ), which we use to define the split invariant coordinate transformations in the bipartite (or mutual) information case.) We then derive what we call the n-partite information (NPI), I[ρ, X (1) ; . . .; The combinatorial number of possible partitions of the spaces for n ≤ N splits is given by the combinatorics of Stirling numbers of the second kind [41].A Stirling number of the second kind ) gives the number of n subsets one can form from a set of N elements.
The definition in terms of binomial coefficients is given by, Thus, the number of unique n-partite informations one can form from a set of N variables is equal to N n .

Remarks on the Upper-Bound of TC
The TC provides an upper-bound for any choice of n-partition information, i.e., any n-partite information in which n < N necessarily satisfies, I[ρ, X (1) ; . . .; This can be shown by using the decomposition of the TC into continuous Shannon entropies which was discussed in [6], where the continuous Shannon entropy (While it is true that the continuous Shannon entropy is not coordinate invariant, the particular combinations used in this paper are, due to the TC and n-partite information being relative entropies themselves.)S[ρ, X] is, Likewise for any n-partition we have the decomposition, ; . . .; Since we in general have the inequality [5] for entropy, then we also have that for any k-th partition of a set of N variables, that the N k exhaustive internal Using ( 96) in (94), we then have that for any n-partite information, ; . . .; Thus, the Total Correlations are always greater than or equal to any correlations between any n-partite splitting of X Upper bounds for the discrete case was discussed in [42].

The Bipartite (Mutual) Information
Perhaps the most studied special case of the NPI is the mutual information (MI), which is the smallest possible n-partition one can form.As was discussed in the introduction, it is useful in inference tasks and was the first quantity to really be defined and exploited [14] out of the general class of n-partite informations.
To analyze the mutual information, consider first relabeling the total space as Z = X 1 × • • • × X N to match the common notation in MI literature.The bipartite information considers only two subspaces, X ⊂ Z and Y ⊂ Z, rather than all of them.These two subspaces define a bipartite split in the proposition space such that X ∩ Y = ∅ and X × Y = Z.This results in turning the product marginal into, where x ∈ X and y ∈ Y. Finally, we arrive at the functional that we will label by its split as, which is the mutual information.Since the marginal space is split into two distinct subspaces, the mutual information only quantifies the correlations between the two subspaces and not between all the variables as is the case with the total correlation for a given split.Whenever the total space Z is two-dimensional, the total correlation and the mutual information coincide.
One can derive the mutual information by using the same steps as in the total correlation derivation above, except replacing the independence condition in (48) with the bipartite marginal in (98).The goal is the same as (Section 3) except that the MI ranks the distributions ρ according to the correlations between two subspaces of propositions, rather than within the entire proposition space.

The Discrete Total Correlation
One may derive a discrete total correlation and discrete NPI by starting from Equation (36), and then following the same arguments without taking the continuous limit after DC1 was imposed.The inferential transformations explored in Section 2.1 are somewhat different for discrete distributions.Coordinate transformations are replaced by general reparameterizations, (An example of such a discrete reparameterization (or discrete coordinate transformation) is intuitively used in coin flipping experiments -the outcome of coin flips may be parameterized with (−1,1) or equally with (0, 1) to represent tails verses heads outcomes, respectively.) in which one defines a bijection between sets, f : X → X Like with general coordinate transformations (3), if f (x i ) is a bijection then we equate the probabilities, Since x i is simply a label for a proposition, we enforce that the probabilities associated to it are independent of the choice of label.One can define discrete split coordinate invariant transformations analogous to the continuous case.Using discrete split invariant coordinate transformations, the index i is removed from F i above, which is analogous to the removal of the x coordinate dependence in the continuous case.The functional equation for the log is found and solved analogously by imposing DC2.The discrete TC is then found and the discrete NPI may be argued.The other transformations in Section 2.1 remain the same, except for replacing integrals by sums.In the above subsections, the continuous relative entropy is replaced by the discrete version, The discrete MI is extremely useful for dealing with problems in communication theory, such as noisy-channel communication and Rate-Distortion theory [5].It is also reasonable to consider situations where one has combinations of discrete and continuous variables.One example is the binary category case [34].

Sufficiency
There is a large literature on the topic of sufficiency [5,30,43] which dates back to work originally done by Fisher [44].Some have argued that the idea dates back to even Laplace [23], a hundred years before Fisher.What both were trying to do ultimately was determine whether one could find simpler statistics that contain all of the required information to make equal inferences about some parameter.
Let p(x, θ) = p(x)p(θ|x) = p(θ)p(x|θ) be a joint distribution over some variables X and some parameters we wish to infer Θ.Consider then a function y = f (x), and also the joint density, p(x, y, θ) = p(x, y)p(θ|x, y) = p(x)p(θ|x, y)δ(y − f (x)). ( If y is a sufficient statistic for x with respect to θ, then the above equation becomes, and the conditional probability p(θ|x, y) = p(θ|y) does not depend on x because y is sufficient.Fisher's factorization theorem states that a sufficient statistic for θ will give the following relation, where f and g are functions that are not necessarily probabilities; i.e., they are not normalized with respect to their arguments, however since the left hand side is certainly normalized with respect to x, then the right hand side must be as well.
We can then identify g(x) = p(x) which only depends on x and f (y|θ) = p(y|θ)/p(y) which is the ratio of two probabilities and hence, not normalized with respect to y.

A New Definition of Sufficiency
While the notion of a sufficient statistic is useful, how can we quantify the sufficiency of a statistic which is not completely sufficient but only partially?What about for the case of generic n-partite systems?The n-partite information can provide an answer.We first begin with the bi-partite case.
Our design derivation shows that the MI is uniquely designed for quantifying the global correlations in the bi-partite case.Because the correlations between two variables indicate how informative one variable is toward the inference of the other, a change in the MI indicates a change in ones ability to make such inferences over the entire spaces of both variables.Thus, we can use this interpretation toward quantifying statistical sufficiency in terms of the change in the amount of correlations in a global sense.
Consider an arbitrary continuous function f : X → X , which we call a statistic of X.We define the sufficiency of the statistic f (X) with respect to another space Θ for a bi-partite system as simply the ratio of mutual informations, which is always bounded by 0 ≤ suff Y ( f ) ≤ 1 due to the data processing inequality in Appendix C.3 and ρ is the distribution defined over the joint space f (X) × Θ.In this problem space, it is assumed that there exists correlation between (X, Θ), i.e., I[ρ, X; Θ] > 0, at least before the statistic is checked for sufficiency.Statistics for which suff Θ [ f (X)] = 1 are called sufficient and correspond to the definition given by Fisher.We can see this by appealing to the special case p(x, f (x), θ) = p(x)p(θ|x)δ(y − f (x)) for some statistic y = f (x).It is true that, so that when p(θ|y) = p(θ|x), I[ρ , f (X); Θ] = dydθ p(y)p(θ|y) log p(θ|y) p(θ) which is the criteria for y to be a sufficient statistic.With this definition of sufficiency (110) we have a way of evaluating maps f (X) which attempt to preserve correlations between X and Y.These procedures are ubiquitous in machine learning [34], manifold learning and other inference tasks, although the explicit connection with statistical sufficiency has yet to be realized.

The n-Partite Sufficiency
Like mutual information, the n-partite information can provide insights into sufficiency.Let us first begin by stating a theorem.
The above theorem provides a ranking of transformation functions ( f x (1) , f x (2) , . . ., f x (n) ) of the n-partitions, I[ρ, X (1) ; . . .; X (n) ] ≥ I[ρ , f (X (1) ); . . .; X (n) ] ≥ I[ρ , f (X (1) ); f (X (2) ); . . .; X (n) ] The action of any set of functions { f } can only ever decrease the amount of n-partite correlations, much as in the data processing inequality.We define the sufficiency of a set of functions { f } analogously to (110).Consider that we generate a set of functions for m of the partitions, leaving n − m partitions alone.Then the sufficiency of the set of functions { f } which act on the subspace {X (k) } m with respect to the remaining m − n partitions is given by, ) ); . . .; f (X (m) ); X (m+1) ; . . .; X (n) ] I[ρ, X (1) ; . . .; where X = {X (k) } n \{X (k) } m .Like the bi-partite sufficiency, the n-partite sufficiency is bounded between zero and one due to the n-partite information inequality.

The n-Partite Joint Processing Inequality
Using the results from Equations (97) and (113), we can express a general result which we call the n-partite joint processing inequality.While the n-partite inequality concerns functions which act individually within the n-partite spaces, we can generalize this notion to functions which act over the entire variable space, i.e., functions which jointly process partitions.
The are many possible combinations of maps of the form (130) and (117) that may analogously be expressed with a continuous notion of sufficiency; however, for the application of successive functions, one will always find a nested set of inequalities of the form (129) and (125).

The Likelihood Ratio
Here we will associate the invariance of MI to invariance of type I and type II errors.Consider a binary decision problem in which we have some set of discriminating variables X, following a mixture of two distributions (e.g., signal and background) labeled by a parameter θ = {s, b}.The inference problem can then be cast in terms of the joint distribution p(x, θ) = p(x)p(θ|x).According to the Neyman-Pearson lemma [24], the likelihood ratio, gives a sufficient statistic for the significance level, where b = H 0 is typically associated to the null hypothesis.This means that the likelihood ratio (132) will allow us to determine if the data X satisfies the significance level in (133).Given Bayes' theorem, the likelihood ratio is equivalent to, which is the posterior ratio and is just as good of a statistic as the likelihood ratio, since p(b)/p(s) is a constant for all x ∈ X.If we then construct a sufficient statistic y = f (x) for X, such that, then the posterior ratios, and hence the likelihood ratios, are equivalent, and hence the significance levels are also invariant, and therefore the type I and type II errors will also be invariant.Thus we can use the MI as a tool for finding the type I and type II errors for some unknown probability distribution by first constructing a sufficient statistic using some technique (typically a ML technique), and then finding the type I and type II errors on the simpler distribution.Then, due to (136), the errors associated to the simpler distribution will be equivalent to the errors of the original unknown one.Apart from its invariance, we can also show another consequence of MI under arbitrary transformations f (X) for binary decision problems.Imagine that we successfully construct a sufficient statistic for X.Then, it is a fact that the likelihood ratios Φ(x) and Φ( f (x)) will be equivalent for all x ∈ X.Consider that we adjust the probability of one value of p(θ| f (x i )) by shifting the relative weight of signal and background for that particular value f (x i ), where δp is some small change, so that the particular value of Π ( f (x i )) = p (s| f (x i )) p (b| f (x i )) = p(s| f (x i )) + δp p(b| f (x i )) which is not equal to the value given from the sufficient statistic.Whether the value Π ( f (x i )) is larger or smaller than Π( f (x i )), in either case either the number of type I or type II errors will increase for the distribution with p (θ| f (x i )) replaced for the sufficient value p(θ| f (x i )).Therefore, for any distribution given by the joint space X × Θ, the MI determines the type I and type II error for any statistic on the data X.

Discussion of Alternative Measures of Correlation
The design derivation in this paper puts the various NPI functionals on an equal foundational footing as the relative entropy.This begs the question as to whether other similar information theoretic quantities can be designed along similar lines.Some quantities that come to mind are α-mutual information [45], multivariate-mutual information [46,47], directed information [48], transfer entropy [49] and causation entropy [50][51][52].
The α-mutual information [45] belongs to the family of functionals that fall under the name Rényi entropies [53] and their close cousin Tsallis entropy [54,55].Tsallis proposed his entropy functional as an attempted generalization for applications of statistical mechanics, however the probability distributions that it produces can be generated from the standard MaxEnt procedure [10] and do not require a new thermodynamics (For some discussion on this topic see [10] page 114.).Likewise, Rényi's family of entropies attempts to generalize the relative entropy for generic inference tasks, which inadvertently relaxes some of the design criteria concerning independence.Essentially, Rényi introduces a set of parameterized entropies S η [p, q], with parameter η, which leads to the weakening of the independent subsystem additivity criteria.Imposing that these functionals then obey subsystem independence immediately constrains η = 0 or η = −1, and reduces them back to the standard relative entropy, i.e., S η=0 [p, q] ≡ S[p, q] and S η=−1 [p, q] ≡ S[q, p].Without a strict understanding of what it means for subsystems to be independent, one cannot conduct reasonable science.Thus, such "generalized" measures of correlation (such as the α-mutual information [45]) which abandon subsystem independence cannot be trusted.
Defining multivariate-mutual information (MMI) is an attempt to generalize the standard MI to a case where several sets of propositions are compared with each other, however this is different than the total correlation which we designed in this paper.For example, given three sets of propositions X, Y and Z, the multivariate-mutual information (not to be confused with the multi-information [56] which is another name for total correlation) is, One difficulty with this expression is that it can be negative, as was shown by Hu [47].Thus, defining a minimum MMI is not possible in general, which suggests that a design derivation of MMI requires a different interpretation.Despite these difficulties, there have been several recent successes in the study of MMI including Bell [57] and Baudot et al. [16,17] who studied MMI in the context of algebraic topology.
Another extension of mutual information is transfer entropy, which was first introduced by Schreiber [49] and is a special case of directed information [48].Transfer entropy is a conditional mutual information which attempts to quantify the amount of "information" that flows between two time-dependent random processes.Given a set of propositions which are dynamical, such that X = X(t) and Y = Y(t), so that at time t i the propositions take the form X(t i ) def = X t i and Y(t i ) def = Y t i , then the transfer entropy (TE) between X(t) and Y(t) at time t i is defined as the conditional mutual information, The notation t j<i refers to all times t j before t i .Thus, the TE is meant to quantify the influence of a variable X on predicting the state Y t i when one already knows the history of Y, i.e., it quantifies the amount of independent correlations provided by X.Given that TE is a conditional mutual information, it does not require a design derivation independent of the MI.It can be justified on the basis of the discussion around (A32).Likewise the more general directed information is also a conditional MI and hence can be justified in the same way.
Finally, the definition of causation entropy [50][51][52] can also be expressed as a conditional mutual information.Causation entropy (CE) attempts to quantify time-dependent correlations between nodes in a connected graph and hence generalize the notion of transfer entropy.Given a set of nodes X, Y and Z, the causation entropy between two subsets conditioned on a third is given by, The above definition reduces to the transfer entropy whenever the set of variables Z = Y.As was shown by Sun et al. [58], the causation entropy (CE) allows one to more appropriately quantify the causal relationships within connected graphs, unlike the transfer entropy which is somewhat limited.Since CE is a conditional mutual information, it does not require an independent design derivation.As with transfer entropy and directed information, the interpretation of CE can be justified on the basis of (A32).

Conclusions
Using a design derivation, we showed that the TC is the functional designed to rank the global amount of correlations in a system between all of its variables.We relied heavily on the PCC, which while quite simple, is restrictive enough to isolate the functional form of I[ρ, X] using eliminative induction.We enforced the PCC using two different methods as an additional measure of rigor (analytically through Taylor expanding and algebraically through the functional Equation ( 71)).The fact that both approaches lead to the same functional shows that the design criteria are highly constraining, and the total correlation is the unique solution.We generalized our solution to the n-partite information-this global correlation quantifier can express the TC and MI as special cases.
Using our design derivation we were able to quantify the amount of global correlations and analyze the effect of inferential transformations in a new light.Because the correlations between variables indicate how informative a set of variables are toward the inference of the others, a change in the global amount indicates a change in ones ability to make such inferences globally.Thus, we can use NPI to quantify statistical sufficiency in terms of the change in the amount of correlations over the entire joint variable spaces.This leads to a rigorous quantification of continuous statistical sufficiency that takes an upper bound of one when the Fisher sufficiency condition satisfied.
. To impose DC1, consider that the new information requires us to change the distribution in one subdomain D ⊂ X , ρ → ρ = ρ + δρ D , that may change the correlations, while leaving the probabilities in the complement domain fixed, δρ D = 0. (The subdomain D and its compliment D obey the relations, D ∩ D = ∅ and D ∪ D = X .)Let the subset of the propositions in D be relabeled as {x 1 , . . ., x d , . . ., x |D| } ⊆ {x 1 , . . ., x i , . . ., x |X | }.Then the variations in I[ρ, X ] with respect to the changes of ρ in the subdomain D are, MMI[ρ, X; Y; Z]def = I[ρ, X; Y] − I[ρ, X; Y|Z] To see this, we can rewrite Equation (105) in terms of the distributions, p(x, y, θ) = p(x, θ)p(y|x, θ) = p(θ)p(x|θ)δ(y − f (x)).