1. Introduction and Statement of the Result
The concept of entropy was introduced in thermodynamic and statistical mechanics as a measure of uncertainty or disorganization in a physical system [
1,
2]. In 1877, L. Boltzmann [
2] gave the probabilistic interpretation of entropy and found the famous formula
. Roughly speaking, entropy is the logarithm of the number of ways in which the physical system can be configured. The second law of thermodynamics says that the entropy of a closed system cannot decrease.
To reveal the physics of information, C. Shannon [
3] introduced entropy into communication theory. Let
X be a discrete random element with alphabet
and probability mass
, the entropy of
X (or
) is defined by
Clearly, , which is called the Shannon entropy, takes its minimum 0 when X is degenerated and takes its maximum when X is uniformly distributed. In this sense, entropy is a measure of the uncertainty of a random element.
In the theory of information, the definition of entropy is extended to a pair of random variables as follows. Let
be a two-dimensional random vector in
with a joint distribution
P =
; the
joint entropy of
(or
P) is defined by
Another important concept for
is
mutual information, which is a measure of the amount of information that one random variable contains about the other. It is defined by
where
and
are the marginal distributions of
X and
Y. By definition, one has
Note that in some settings, the maximum of a variable’s
mutual information is called its
channel capacity, which plays a key role in information theory through the famous
Shannon’s second theorem: the Channel Coding Theorem [
3].
For basic concepts and properties in information theory, readers may refer to [
4] and the references therein. For entropy and its developments in various topics of pure mathematics, readers may pay attention to the series of talks given by Xiang-Dong Li [
5].
By (
4), for the given marginals
and
, maximizing
and minimizing
are two sides of a single coin. To infer an unknown joint distribution of two random variables with given marginals is an old problem in the area of probabilistic inference. As far as we know, the problem goes back to at least Frechet [
6] and Hoeffding [
7,
8], who studied the question of identifying the extremal joint distribution that maximized (or minimized, respectively) their correlation. For more studies in this area and more applications in pure and applied sciences, readers may refer to [
9,
10,
11,
12,
13,
14], etc.
In this paper, we consider the following setting of the problem described above. For simplicity, suppose , and and be two discrete probability distributions on . Let X and Y be the random variables in , with distributions and , respectively; then, we try to seek a minimum-entropy of two-dimensional random vector in with the marginals and .
One strategy to solve the problem mentioned above is to calculate the exact value of the minimum entropy,
. Our efforts in this direction will be reported in a following paper [
15]: depending on the local optimization lemmas given in
Section 2, we successfully obtain an algorithm to calculate the exact value of the minimal joint entropy for any given marginals
and
. Note that the computation complexity for our algorithm is
.
Another strategy to study the problem is to seek the unknown special structure of a minimum-entropy coupling . Clearly, in cases where X and Y are independent, the joint entropy takes the maximum . That is to say, the independent structure (maybe the most disordered one) of determines the maximum entropy, but what special structure in a coupling will determine the minimum entropy of a two-dimensional random system? The main goal of the present paper is to establish such a structure.
Denote by
the set of all discrete probability distributions on
. For each
, let
be the cumulative distribution function, defined by
Recall that a permutation
is a bijective map from
into itself. For any
, define
. From the Equation of (
1), one has
which holds for any permutation
. For a random variable
X with distribution
, the random variable
has the distribution
, where
is the inverse of
.
For any
, denote by
the set of all joint distributions
P with the marginals
. For any
, suppose
is distributed according to
P. For any permutation pair
, denote by
the joint distribution of
. Then,
For any
, denote by
the matrix such that
;
;
; and
. Otherwise, let
. For any
n-th order probability matrix
P, let
(resp.
); then,
is the matrix obtained from
P by exchanging the positions of its
k-th and
l-th rows (resp. columns). Furthermore, for any
and any finite sequences
and
, there exists a unique permutation pair
, such that
Definition 1. For any , suppose is distributed according to . We note the following:
- 1.
, or P, is order-preserving, if ;
- 2.
, or P, is essentially order-preserving, if for some permutation pair ,
being order-preserving is also called “X being stochastically dominated by Y” in probability theory. Clearly, it holds if and any if any , , i.e., the joint distribution is upper triangular. For an essentially order-preserving , for the permutation pair given in Definition 1, is upper triangular.
For any
, by Strassen’s Theorem [
16] on stochastic domination, there exists a coupling
such that
P is upper triangular if and only if
, where
and
are defined in (
5). Note that in the literature on information theory (see [
11,
17] etc.), people usually say that “
is majorized by
” when
.
Now, we turn to the following optimization problem:
Note that forms a compact subset of ; the existence of such a follows from the continuity of the entropy function H. Note that in this paper, we also call the minimum-entropy coupling.
What key structural characteristics of a coupling (X,Y) govern the minimum entropy in the two-dimensional stochastic system, as formulated in optimization problem (
9)? Specifically, what mathematical properties must the joint distribution (X,Y) satisfy to achieve the global minimization of the joint entropy? Is there an intrinsic connection between such optimal coupling structures and the order-preserving properties of the variables, as suggested by entropy minimization principles? We state our main result as follows.
Theorem 1. Suppose and . If solves the optimization problem (9), and is distributed according to , then , or , is essentially order-preserving. Theorem 1 shows that the coupling with the minimal entropy must be essentially order-preserving, whereas the coupling with the maximal entropy aligns with independence. This means that the minimum-entropy coupling in a two-dimensional discrete system must be an upper triangular discrete joint distribution which can be formed by exchanging the rows and columns of the joint distribution matrix based on Theorem 3. Consequently, entropy is interpreted as a measure of system disorder.
Another structure of the minimum-entropy coupling is discovered in paper [
15]: a tree structure in graph theory. Based on this structure, we develop an algorithm to obtain the exact value of the minimal entropy.
The rest of the paper is arranged as follows. In
Section 2.1, we first develop the following local optimization lemmas: Lemma 2 and Lemma 3. Then, for any
, by using the local optimization lemmas, we construct a
such that
. In
Section 2.2, by Lemma 2, developed in
Section 2.1, we prove Theorem 1. In
Section 3, for
for
P, the independent coupling of
is given as an example for Theorem 3, and we optimise
P to an upper triangular
.
3. An Example
Theorem 1 shows that the coupling with the minimal entropy must be essentially order-preserving. How can one construct such entropy-equalizing order-preserving coupling structures? In this section, we offer a computational approach for an order-preserving coupling as a practical illustration. Without loss of generality, let us consider the following example: Let
,
,
, and
be the independent coupling of
:
Now, we begin to optimise P to a upper triangular , as follows.
First, for the largest entry of
P is
and
, by exchanging the positions of the 4th and 6th rows and then the positions of the 3rd and 6th columns of
P, we obtain
By using Lemma 2 on submatrixes
we renew and optimize
A to
then, using Lemma 2 on submatrix
we optimize
A to the following
In this situation, we denote
since
is the largest entry and
we take
,
, and
satisfies the conditions of Lemma 4 for
.
Second, by Lemma 4,
is optimised to the following
Noticing that the largest entry of matrix
is
, and
, we optimise
to
Finally, by similar procedures, we optimize
to
, to
, and then to
as follows:
Clearly, is upper triangular; with and
By Lemma 2,
can be further optimised to an upper triangular matrix
Using Lemma 2 on the following submatrix of
and then exchanging the positions of its 2nd and 4th rows, we obtain
Note that
is upper triangular and
with
and
.