Research on the Characteristics of Joint Distribution Based on Minimum Entropy

Ma, Ya-Jing; Wang, Feng; Wu, Xian-Yuan; Cai, Kai-Yuan

doi:10.3390/math13060972

Open AccessFeature PaperArticle

Research on the Characteristics of Joint Distribution Based on Minimum Entropy

¹

School of Mathematical Sciences, Capital Normal University, Beijing 100048, China

²

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(6), 972; https://doi.org/10.3390/math13060972

Submission received: 27 January 2025 / Revised: 24 February 2025 / Accepted: 6 March 2025 / Published: 14 March 2025

Download Versions Notes

Abstract

This paper focuses on the extreme-value issue of Shannon entropy for joint distributions with specified marginals, a subject of growing interest. It introduces a theorem showing that the coupling with minimal entropy must be essentially order-preserving, whereas the coupling with maximal entropy aligns with independence. This means that the minimum-entropy coupling in a two-dimensional system forms an upper triangular discrete joint distribution by exchanging the rows and columns of the joint distribution matrix. Consequently, entropy is interpreted as a measure of system disorder. This manuscript’s key academic contribution is in clarifying the physical meaning behind optimal-entropy coupling, where a special ordinal relationship is pinpointed and methodically outlined. Furthermore, it offers a computational approach for order-preserving coupling as a practical illustration.

Keywords:

Shannon entropy; minimum-entropy coupling; essentially order-preserving coupling; local optimization

MSC:

60E15; 94A17

1. Introduction and Statement of the Result

The concept of entropy was introduced in thermodynamic and statistical mechanics as a measure of uncertainty or disorganization in a physical system [1,2]. In 1877, L. Boltzmann [2] gave the probabilistic interpretation of entropy and found the famous formula

S = κ log W

. Roughly speaking, entropy is the logarithm of the number of ways in which the physical system can be configured. The second law of thermodynamics says that the entropy of a closed system cannot decrease.

To reveal the physics of information, C. Shannon [3] introduced entropy into communication theory. Let X be a discrete random element with alphabet

X

and probability mass

p = {p (x) = P (X = x) : x \in X}

, the entropy of X (or

p

) is defined by

H (X) = H (p) : = - \sum_{x \in X} p (x) log p (x) .

(1)

Clearly,

H (X)

, which is called the Shannon entropy, takes its minimum 0 when X is degenerated and takes its maximum

log | X |

when X is uniformly distributed. In this sense, entropy is a measure of the uncertainty of a random element.

In the theory of information, the definition of entropy is extended to a pair of random variables as follows. Let

(X, Y)

be a two-dimensional random vector in

X \times Y

with a joint distribution P =

{p (x, y) : x \in X, y \in Y}

; the joint entropy of

(X, Y)

(or P) is defined by

H (X, Y) = H (P) : = - \sum_{x \in X} \sum_{y \in Y} p (x, y) log p (x, y) .

(2)

Another important concept for

(X, Y)

is mutual information, which is a measure of the amount of information that one random variable contains about the other. It is defined by

I (X, Y) : = \sum_{x \in X} \sum_{y \in Y} p (x, y) log \frac{p (x, y)}{p (x) p (y)},

(3)

where

{p (x) : x \in X}

and

{p (y) : y \in Y}

are the marginal distributions of X and Y. By definition, one has

I (X, Y) = H (X) + H (Y) - H (X, Y) .

(4)

Note that in some settings, the maximum of a variable’s mutual information is called its channel capacity, which plays a key role in information theory through the famous Shannon’s second theorem: the Channel Coding Theorem [3].

For basic concepts and properties in information theory, readers may refer to [4] and the references therein. For entropy and its developments in various topics of pure mathematics, readers may pay attention to the series of talks given by Xiang-Dong Li [5].

By (4), for the given marginals

{p (x) : x \in X}

and

{p (y) : y \in Y}

, maximizing

I (X, Y)

and minimizing

H (X, Y)

are two sides of a single coin. To infer an unknown joint distribution of two random variables with given marginals is an old problem in the area of probabilistic inference. As far as we know, the problem goes back to at least Frechet [6] and Hoeffding [7,8], who studied the question of identifying the extremal joint distribution that maximized (or minimized, respectively) their correlation. For more studies in this area and more applications in pure and applied sciences, readers may refer to [9,10,11,12,13,14], etc.

In this paper, we consider the following setting of the problem described above. For simplicity, suppose

X = Y = {1, 2, \dots, n}, n \geq 2

, and

p

= (p_{1}, p_{2}, \dots, p_{n})

and

q

= (q_{1}, q_{2}, \dots, q_{n})

be two discrete probability distributions on

X

. Let X and Y be the random variables in

X

, with distributions

p

and

q

, respectively; then, we try to seek a minimum-entropy of two-dimensional random vector

(X, Y)

in

X \times X

with the marginals

p

and

q

.

One strategy to solve the problem mentioned above is to calculate the exact value of the minimum entropy,

H (X, Y)

. Our efforts in this direction will be reported in a following paper [15]: depending on the local optimization lemmas given in Section 2, we successfully obtain an algorithm to calculate the exact value of the minimal joint entropy for any given marginals

p

and

q

. Note that the computation complexity for our algorithm is

(\binom{n^{2}}{2 n})

.

Another strategy to study the problem is to seek the unknown special structure of a minimum-entropy coupling

(X, Y)

. Clearly, in cases where X and Y are independent, the joint entropy

H (X, Y)

takes the maximum

H (X) + H (Y)

. That is to say, the independent structure (maybe the most disordered one) of

(X, Y)

determines the maximum entropy, but what special structure in a coupling

(X, Y)

will determine the minimum entropy of a two-dimensional random system? The main goal of the present paper is to establish such a structure.

Denote by

P_{n}

the set of all discrete probability distributions on

X

= {1,

2, \dots, n}

. For each

p \in P_{n}

, let

F_{p}

be the cumulative distribution function, defined by

F_{p} (i) : = \sum_{k = 1}^{i} p_{k}, 1 \leq i \leq n .

(5)

Recall that a permutation

σ

is a bijective map from

X

= {1, 2, \dots, n}

into itself. For any

p = (p_{1}, p_{2}, \dots, p_{n}) \in P_{n}

, define

σ p : = (p_{σ (1)}, p_{σ (2)}, \dots, p_{σ (n)})

. From the Equation of (1), one has

H (p) = H (σ p)

(6)

which holds for any permutation

σ

. For a random variable X with distribution

p

, the random variable

σ X

has the distribution

σ^{- 1} p

, where

σ^{- 1}

is the inverse of

σ

.

For any

p, q \in P_{n}

, denote by

C (p, q)

the set of all joint distributions P with the marginals

p, q

. For any

P \in C (p, q)

, suppose

(X, Y)

is distributed according to P. For any permutation pair

(σ, π)

, denote by

P (σ, π)

the joint distribution of

(σ X, π Y)

. Then,

P (σ, π) \in C (σ^{- 1} p, π^{- 1} q) and H (P (σ, π)) = H (P) .

(7)

For any

1 \leq k, l \leq n

, denote by

E (k, l) = {(e_{i, j})}_{n \times n}

the matrix such that

e_{i, i} = 1

;

i \neq k, l

;

e_{k, l} = e_{l, k} = 1

; and

e_{i, j} = 0

. Otherwise, let

E

: = {E (k, l) : 1 \leq k, l \leq n}

. For any n-th order probability matrix P, let

P^{'} = E (k, l) \cdot P

(resp.

P \cdot E (k, l)

); then,

P^{'}

is the matrix obtained from P by exchanging the positions of its k-th and l-th rows (resp. columns). Furthermore, for any

P \in C (p, q)

and any finite sequences

{E_{k} \in E : 1 \leq k \leq s}

and

{E_{l}^{'} \in E : 1 \leq l \leq t}

, there exists a unique permutation pair

(σ, π)

, such that

P (σ, π) = E_{1} \cdot E_{2} \cdot \dots \cdot E_{s} \cdot P \cdot E_{t}^{'} \cdot E_{t - 1}^{'} \cdot \dots \cdot E_{2}^{'} \cdot E_{1}^{'} .

(8)

Definition 1.

For any

p, q \in P_{n}

, suppose

(X, Y)

is distributed according to

P \in C (p, q)

. We note the following:

1.: $(X, Y)$ , or P, is order-preserving, if $P (X \leq Y) = 1$ ;
2.: $(X, Y)$ , or P, is essentially order-preserving, if for some permutation pair $(σ, π)$ ,

$P (σ X \leq π Y) = 1 .$

(X, Y)

being order-preserving is also called “X being stochastically dominated by Y” in probability theory. Clearly, it holds if and any if any

1 \leq j < i \leq n

,

P (X = i, Y = j) = p_{i, j} = 0

, i.e., the joint distribution

P = (p_{i, j})

is upper triangular. For an essentially order-preserving

P \in C (p, q)

, for the permutation pair

(σ, π)

given in Definition 1,

P (σ, π)

is upper triangular.

For any

p, q \in P_{n}

, by Strassen’s Theorem [16] on stochastic domination, there exists a coupling

P \in C (p, q)

such that P is upper triangular if and only if

F_{q} \leq F_{p}

, where

F_{p}

and

F_{q}

are defined in (5). Note that in the literature on information theory (see [11,17] etc.), people usually say that “

q

is majorized by

p

” when

F_{q} \leq F_{p}

.

Now, we turn to the following optimization problem:

\tilde{P} : H (\tilde{P}) = inf_{P \in C (p, q)} H (P) .

(9)

Note that

C (p, q)

forms a compact subset of

R^{n^{2}}

; the existence of such a

\tilde{P}

follows from the continuity of the entropy function H. Note that in this paper, we also call

\tilde{P}

the minimum-entropy coupling.

What key structural characteristics of a coupling (X,Y) govern the minimum entropy in the two-dimensional stochastic system, as formulated in optimization problem (9)? Specifically, what mathematical properties must the joint distribution (X,Y) satisfy to achieve the global minimization of the joint entropy? Is there an intrinsic connection between such optimal coupling structures and the order-preserving properties of the variables, as suggested by entropy minimization principles? We state our main result as follows.

Theorem 1.

Suppose

n \geq 2

and

p, q

\in P_{n}

. If

\tilde{P} \in C (p, q)

solves the optimization problem (9), and

(X, Y)

is distributed according to

\tilde{P}

, then

(X, Y)

, or

\tilde{P}

, is essentially order-preserving.

Theorem 1 shows that the coupling with the minimal entropy must be essentially order-preserving, whereas the coupling with the maximal entropy aligns with independence. This means that the minimum-entropy coupling in a two-dimensional discrete system must be an upper triangular discrete joint distribution which can be formed by exchanging the rows and columns of the joint distribution matrix based on Theorem 3. Consequently, entropy is interpreted as a measure of system disorder.

Another structure of the minimum-entropy coupling is discovered in paper [15]: a tree structure in graph theory. Based on this structure, we develop an algorithm to obtain the exact value of the minimal entropy.

The rest of the paper is arranged as follows. In Section 2.1, we first develop the following local optimization lemmas: Lemma 2 and Lemma 3. Then, for any

P \in C (p, q)

, by using the local optimization lemmas, we construct a

P^{'} \in C (p, q)

such that

H (P) \geq H (P^{'})

. In Section 2.2, by Lemma 2, developed in Section 2.1, we prove Theorem 1. In Section 3, for

n = 6

for P, the independent coupling of

p, q \in P_{6}

is given as an example for Theorem 3, and we optimise P to an upper triangular

P^{'}

.

2. Local Optimization and Proof of Result

2.1. Local Optimization Lemmas

Suppose

A = {(a_{i, j})}_{n \times n}

is a nonnegative matrix (i.e., all its entries are nonnegative) such that

C : = \sum_{i = 1}^{n} \sum_{j = 1}^{n} a_{i, j} > 0

. We generalize the definition of entropy for nonnegative matrix A as

H (A) = - h (A) : = - \sum_{i = 1}^{n} \sum_{j = 1}^{n} a_{i, j} log a_{i, j} .

(10)

Let

P = C^{- 1} A

, a probability matrix, then

H (A) = C H (P) - C log C .

(11)

For any

c > 0

, define

h_{c} (x) = x log x + (c - x) log (c - x)

and

x \in [0, c]

. Before the local optimization lemmas, we give out the following simple property for function

h_{c}

without proof.

Lemma 1.

For any closed interval

[a, b] \subseteq [0, c]

,

max {h_{c} (x) : x \in [a, b]} = h_{c} (a) \lor h_{c} (b)

= h_{c} (a) I_{{a \leq c - b}} + h_{c} (b) I_{{a > c - b}} .

Lemma 2.

For any second-order nonnegative matrix

A = {(a_{i, j})}_{2 \times 2}

, suppose that

a_{1, 1} \lor a_{2, 2} \geq a_{1, 2} \lor a_{2, 1}

and denote

b = a_{1, 2} \land a_{2, 1}

. Let

A^{'} = {(a_{i, j}^{'})}_{2 \times 2}

such that

a_{i, i}^{'} = a_{i, i} + b, i = 1, 2

,

a_{i, j}^{'} = a_{i, j} - b

, and

i \neq j .

Then,

H (A) \geq H (A^{'})

. Furthermore, if

b > 0

, then

H (A) > H (A^{'})

.

Proof.

Without loss of generality, assume that

a_{1, 1} \geq a_{2, 1} \geq a_{1, 2}

. Note that in this case, one has

b = a_{1, 2}

and

a_{1, 2}^{'} = 0

. For any

x \in [0, b]

, define

A (x) = (\begin{matrix} a_{1, 1} + x & a_{1, 2} - x \\ a_{2, 1} - x & a_{2, 2} + x \end{matrix}) .

Let

c_{1} = a_{1, 1} + a_{2, 1}

and

c_{2} = a_{1, 2} + a_{2, 2}

; then,

h (A (x)) = h_{c_{1}} (a_{1, 1} + x) + h_{c_{2}} (a_{1, 2} - x)

.

By Lemma 1, and the assumption given above,

max {h_{c_{1}} (a_{1, 1} + x) : x \in [0, b]} = h_{c_{1}} (a_{1, 1} + b)

and

max {h_{c_{2}} (a_{1, 2} - x) : x \in [0, b]} = h_{c_{2}} (0) .

Thus,

max {h (A (x)) : x \in [0, b]} = h (A (b))

and

h (A) = h (A (0)) \leq h (A (b)) = h (A^{'})

; then, by definition

H (A) \geq H (A^{'})

.

If

b = a_{1, 2} > 0

, then

a_{1, 1} > a_{2, 1} - a_{1, 2}

. By Lemma 1,

h_{c_{1}} (a_{1, 1}) < h_{c_{1}} (a_{1, 1} + a_{1, 2})

; this implies that

h (A) < h (A^{'})

and

H (A) > H (A^{'})

. □

Lemma 3.

For any second-order nonnegative matrix

A = {(a_{i, j})}_{2 \times 2}

, suppose that

a_{1, 1} + a_{1, 2} \geq a_{2, 1} + a_{2, 2}

,

a_{1, 1} + a_{2, 1} \geq a_{1, 2} + a_{2, 2}

, and

a_{1, 1} + a_{1, 2} \geq a_{1, 1} + a_{2, 1}

. Let

b = a_{1, 2} \land a_{2, 1}

and define

A^{'}

as in Lemma 2; then,

H (A) \geq H (A^{'})

.

Proof.

By Relation (11), without loss of generality, assume A to be a probability matrix. Rewrite the probability matrix by

P = {(p_{i, j})}_{2 \times 2}

, and write

p = (p_{1}, p_{2}) = (p, 1 - p)

and

q = (q_{1}, q_{2}) = (q, 1 - q)

as its marginals, i.e.,

p_{1, 1} + p_{1, 2} = p

and

p_{2, 1} + p_{2, 2} = 1 - p

, and

p_{1, 1} + p_{2, 1} = q

and

p_{1, 2} + p_{2, 2} = 1 - q

. By the conditions of the lemma, one has

p \geq 1 - p

,

q \geq 1 - q

and

p \geq q \geq \frac{1}{2}

.

For any

x \in [0, 1 - p]

, let

P (x) = (\begin{matrix} q - x & (p - q) + x \\ x & (1 - p) - x \end{matrix}) .

Clearly,

P (x)

is a joint probability distribution matrix with marginals

p, q

, and

P (p_{2, 1}) = P

and

P (0) = P^{'}

. Particularly, when x runs over the interval

[0, 1 - p]

,

P (x)

runs over

C (p, q)

.

Now,

h (P (x)) = h_{p} (q - x) + h_{1 - p} (x)

. It follows immediately from Lemma 1 that

h_{1 - p} (x)

and

h_{p} (q - x)

take their maximums simultaneously when

x = 0

; then,

h (P) = h (P (p_{2, 1})) \leq h (P (0)) = h (P^{'})

and

H (P) \geq H (P^{'})

. □

From the statements of Lemmas 2 and 3, for any second-order nonnegative matrix A, one may easily construct the nonnegative matrix

A^{'}

, which possesses the same row and column summations as A, such that

H (A) \geq H (A^{'})

.

As a consequence of Lemma 3, the optimization problem (9) for

n = 2

is solved as follows.

Corollary 1.

For

p, q \in P_{2}

such that

p = (p, 1 - p)

,

q = (q, 1 - q)

, and

p \geq q \geq \frac{1}{2}

, let

\tilde{P} = (\begin{matrix} q & p - q \\ 0 & 1 - p \end{matrix}),

then

H (\tilde{P}) = inf_{P \in C (p, q)} H (P) = - [q log q + (1 - p) log (1 - p) + (p - q) log (p - q)] .

Now, as a direct consequence of Lemmas 2 and 3, we obtain the following local optimization theorem.

Theorem 2.

Suppose

p, q \in P_{n}

,

n \geq 2

, and

P \in C (p, q)

. For any

1 \leq i < k \leq n

and

1 \leq j < l \leq n

, let

A = (\begin{matrix} p_{i, j} & p_{i, l} \\ p_{k, j} & p_{k, l} \end{matrix})

(12)

be the second-order submatrix of P. Let

A^{'}

be given as in Lemma 2 or Lemma 3 and let

P^{'}

be the matrix obtained from P by

A^{'}

taking the place of A. Then,

P^{'} \in C (p, q)

and

H (P) \geq H (P^{'})

.

2.2. Proof of Theorem 1

In this subsection, we give proof to Theorem 1. Actually, the strategy of the proof is to use the local optimization Lemma 2 repeatedly. First of all, we have the following lemma.

Lemma 4.

Let

A = {(a_{i, j})}_{n \times n}, n \geq 2

be an n-th-order nonnegative matrix. Suppose that

a_{1, 1} = max_{1 \leq i, j \leq n} a_{i, j}, \sum_{j = 1}^{n} a_{1, j} \geq \sum_{i = 1}^{n} a_{i, 1}

(13)

(resp . a_{n, n} = max_{1 \leq i, j \leq n} a_{i, j}, \sum_{j = 1}^{n} a_{n, j} \leq \sum_{i = 1}^{n} a_{i, n}) .

Then, by using the local optimization procedure developed in Lemma 2 and Theorem 2 at most

2 (n - 1)

times, finally transform A to

A^{'} = {(a_{i, j}^{'})}_{n \times n}

such that

H (A) \geq H (A^{'})

,

\sum_{i = 1}^{n} a_{i, j} = \sum_{i = 1}^{n} a_{i, j}^{'}, \sum_{j = 1}^{n} a_{i, j} = \sum_{j = 1}^{n} a_{i, j}^{'}, 1 \leq i, j \leq n, and

(14)

a_{1, 1}^{'} = \sum_{i = 1}^{n} a_{i, 1}, a_{i, 1}^{'} = 0, 2 \leq i \leq n

(15)

(resp . a_{n, n}^{'} = \sum_{j = 1}^{n} a_{n, j}, a_{n, j}^{'} = 0, 1 \leq j \leq n - 1) .

Furthermore,

H (A) = H (A^{'})

if and only if

a_{1, 1}^{'} = a_{1, 1}

(resp . a_{n, n}^{'} = a_{n, n}) .

Proof.

Let

I = {i : a_{i, 1} > 0, 2 \leq i \leq n}

and

J = {j : a_{1, j} > 0, 2 \leq j \leq n}

, and denote by

| I |

and

| J |

the cardinalities of I and J. Without loss of generality, assume that

i_{0} \in I, j_{0} \in J

satisfies

a_{i_{0}, 1} = min_{i \in I} a_{i, 1} \land min_{j \in J} a_{1, j}, a_{1, j_{0}} = min_{j \in J} a_{1, j} .

Write

b = a_{i_{0}, 1}

, and renew A by changing the second-order submatrix

(\begin{matrix} a_{1, 1} & a_{1, j_{0}} \\ a_{i_{0}, 1} & a_{i_{0}, j_{0}} \end{matrix}) to (\begin{matrix} a_{1, 1} + b & a_{1, j_{0}} - b \\ a_{i_{0}, 1} - b & a_{i_{0}, j_{0}} + b \end{matrix}) .

Then, the renewed A still satisfies the conditions of Lemma 2, and then, by this lemma, its entropy

H (A)

is decreased strictly. Note that after using the local optimization procedure once, for the renewed matrix A, the number

| I | + | J |

decreases by 1.

Repeat the above procedure until

I = \emptyset

and write

A^{'}

as the final renewing of A; thus,

A^{'}

is obtained as required. □

In the situation given as in the statement of Lemma 4, we denote

L_{n}

as the corresponding composite optimization procedure, and write

A^{'} = L_{n} (A)

.

Now, we finish the proof of Theorem 1 by proving the following theorem.

Theorem 3.

For any

p, q \in P_{n}, n \geq 2

,

P \in C (p, q)

, there exists a permutation pair

(\bar{σ}, \bar{π})

and an upper triangular

P^{'} \in C ({\bar{σ}}^{- 1} p, {\bar{π}}^{- 1} q)

such that

H (P) \geq H (P^{'})

. Furthermore, cases where

P (σ, π)

is not upper triangular for any permutation pair

(σ, π)

,

H (P) > H (P^{'})

.

Proof.

For

n = 2

, Theorem 3 follows from Lemma 2 immediately. For general

n > 2

and

P = {(p_{i, j})}_{n \times n} \in C (p, q)

, suppose

p_{i_{0}, j_{0}} = max_{1 \leq i, j \leq n} p_{i, j}

. In the case of

\sum_{j = 1}^{n} p_{i_{0}, j} \geq \sum_{i = 1}^{n} p_{i, j_{0}} (resp . \sum_{j = 1}^{n} p_{i_{0}, j} \leq \sum_{i = 1}^{n} p_{i, j_{0}}),

let

A = E (1, i_{0}) \cdot P \cdot E (1, j_{0})

(resp.

E (n, i_{0}) \cdot P \cdot E (n, j_{0})

), where

E (1, i_{0})

,

E (1, j_{0})

,

E (n, i_{0})

, and

E (n, j_{0}) \in E

. By (8), for some given permutation pair

(σ_{1}, π_{1})

,

A = P (σ_{1}, π_{1})

and A satisfies Condition (13). By Lemma 4, Equations (14) and (15) hold for

A^{'} = L_{n} (A)

.

Note that by (14), A and

A^{'}

are all the probability matrixes in

C (σ_{1}^{- 1} p, π_{1}^{- 1} q)

, and

H (P) = H (A) \geq H (A^{'})

.

Now,

A^{'}

has the form

A^{'} = (\begin{matrix} a_{1, 1}^{'} & a_{1, 2}^{'} & \dots & a_{1, n - 1}^{'} & a_{1, n}^{'} \\ 0 & a_{2, 2}^{'} & \dots & a_{2, n - 1}^{'} & a_{2, n}^{'} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & a_{n - 1, 2}^{'} & \dots & a_{n - 1, n - 1}^{'} & a_{n - 1, n}^{'} \\ 0 & a_{n, 2}^{'} & \dots & a_{n, n - 1}^{'} & a_{n, n}^{'} \end{matrix})

(16)

(resp . (\begin{matrix} a_{1, 1}^{'} & a_{1, 2}^{'} & \dots & a_{1, n - 1}^{'} & a_{1, n}^{'} \\ a_{2, 1}^{'} & a_{2, 2}^{'} & \dots & a_{2, n - 1}^{'} & a_{2, n}^{'} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ a_{n - 1, 1}^{'} & a_{n - 1, 2}^{'} & \dots & a_{n - 1, n - 1}^{'} & a_{n - 1, n}^{'} \\ 0 & 0 & \dots & 0 & a_{n, n}^{'} \end{matrix})) .

Consider the following

(n - 1)

-th-order submatrix:

\bar{A} = {({\bar{a}}_{i, j})}_{(n - 1) \times (n - 1)}

of

A^{'}

with

{\bar{a}}_{i, j} = a_{i + 1, j + 1}^{'}

(resp.

a_{i, j}^{'}

) for

1 \leq i, j \leq n - 1

,

\bar{A} = (\begin{matrix} a_{2, 2}^{'} & \dots & a_{2, n - 1}^{'} & a_{2, n}^{'} \\ ⋮ & ⋱ & ⋮ & ⋮ \\ a_{n - 1, 2}^{'} & \dots & a_{n - 1, n - 1}^{'} & a_{n - 1, n}^{'} \\ a_{n, 2}^{'} & \dots & a_{n, n - 1}^{'} & a_{n, n}^{'} \end{matrix})

(resp . (\begin{matrix} a_{1, 1}^{'} & a_{1, 2}^{'} & \dots & a_{1, n - 1}^{'} \\ a_{2, 1}^{'} & a_{2, 2}^{'} & \dots & a_{2, n - 1}^{'} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ a_{n - 1, 1}^{'} & a_{n - 1, 2}^{'} & \dots & a_{n - 1, n - 1}^{'} \end{matrix})),

suppose

a_{i_{1}, j_{1}}^{'}

to be the entry of

\bar{A}

such that

a_{i_{1}, j_{1}}^{'} = max_{1 \leq i, j \leq n - 1} {\bar{a}}_{i, j} = max_{2 \leq i, j \leq n} a_{i, j}^{'}

(resp.

max_{1 \leq i, j \leq n - 1} a_{i, j}^{'}

).

For simplicity, we only treat the first case of

\bar{A}

(the resp. case in the brackets can be treated similarly); of course, in this case, one has

2 \leq i_{1}, j_{1} \leq n .

In the subcase when

\sum_{i = 2}^{n} a_{i, j_{1}}^{'} \leq \sum_{j = 2}^{n} a_{i_{1}, j}^{'} (resp . \sum_{i = 2}^{n} a_{i, j_{1}}^{'} \geq \sum_{j = 2}^{n} a_{i_{1}, j}^{'}),

let

A^{″} = {(a_{i, j}^{″})}_{n \times n} = E (2, i_{1}) \cdot A^{'} \cdot E (2, j_{1})

(resp.

E (n, i_{1}) \cdot A^{'} \cdot E (n, j_{1})

). Clearly, by (8), for some given permutation pair

(σ_{2}, π_{2})

,

A^{″} = A^{'} (σ_{2}, π_{2}) \in C ({(σ_{1} σ_{2})}^{- 1} p, {(π_{1} π_{2})}^{- 1} q)

, and

H (P) \geq H (A^{'}) = H (A^{″}) .

Here, we note the following:

$A^{″}$ has the same form of $A^{'}$ as given in Equation (16), i.e., all entries except for $a_{1, 1}^{″} = a_{1, 1}^{'}$ in the first column of $A^{″}$ are zero. In fact, by transforming P to $A^{″}$ , we finished the first step of upper-triangulization;
$\bar{\bar{A}} : = (a_{i, j}^{″} : 2 \leq i, j \leq n)$ , as the $(n - 1)$ -th-order submatrix of $A^{″}$ satisfies the condition of Lemma 4 for $n - 1$ .

At this moment, we have finished the first step of the optimization of P by transforming P to

A^{″}

. In fact, we are standing at the position to begin the second step by using Lemma 4 on

\bar{\bar{A}}

for

n - 1

.

Repeat the above procedure for

n - 1

,

n - 2

, ⋯, and finally for 2;

A^{″}

is then transformed to

A^{⁗}

,

A^{(6)}

, and finally to

P^{'} = A^{(2 (n - 1))}

. Additionally, certain permutation pair sequences

{(σ_{k}, π_{k}) : 1 \leq k \leq n}

further exist, such that,

A^{(2 k)} \in C ({(σ_{1} \cdot \dots \cdot σ_{k + 1})}^{- 1} p, {(π_{1} \cdot \dots \cdot π_{k + 1})}^{- 1} q), \forall 1 \leq k \leq n - 1,

and

H (P) = H (A) \geq H (A^{″}) \geq H (A^{⁗}) \geq \dots \geq H (A^{(2 (n - 1))}) = H (P^{'}) .

Now, let

\bar{σ} : = σ_{1} \cdot \dots \cdot σ_{n}

,

\bar{π} : = π_{1} \cdot \dots \cdot π_{n}

. Then,

P^{'} \in C ({\bar{σ}}^{- 1} p, {\bar{π}}^{- 1} q)

, and by our setting,

P^{'}

is upper triangular.

If

P (σ, π)

is not upper triangular for any permutation pair

(σ, π)

, then at least one optimization step carried out above is concrete, i.e., when Lemma 2 is used, the corresponding b is strictly positive. Then, by Lemma 2 and the followed Theorem 2,

H (P) > H (P^{'})

. □

3. An Example

Theorem 1 shows that the coupling with the minimal entropy must be essentially order-preserving. How can one construct such entropy-equalizing order-preserving coupling structures? In this section, we offer a computational approach for an order-preserving coupling as a practical illustration. Without loss of generality, let us consider the following example: Let

n = 6

,

p = (0.3, 0.15, 0.08, 0.4,

0.03, 0.04))

,

q = (0.18, 0.15, 0.44, 0.02, 0.03, 0.18)

, and

P \in C (p, q)

be the independent coupling of

p, q

:

P = (\begin{matrix} 0.054 & 0.045 & 0.132 & 0.006 & 0.009 & 0.054 \\ 0.027 & 0.0225 & 0.066 & 0.003 & 0.0045 & 0.027 \\ 0.0144 & 0.012 & 0.0352 & 0.0016 & 0.0024 & 0.0144 \\ 0.072 & 0.06 & 0.176 & 0.008 & 0.012 & 0.072 \\ 0.0054 & 0.0045 & 0.0132 & 0.0006 & 0.0009 & 0.0054 \\ 0.0072 & 0.006 & 0.0176 & 0.0008 & 0.0012 & 0.0072 \end{matrix}) .

Now, we begin to optimise P to a upper triangular

P^{'}

, as follows.

First, for the largest entry of P is

p_{4, 3} = 0.176

and

q_{3} = 0.44 > p_{4} = 0.4

, by exchanging the positions of the 4th and 6th rows and then the positions of the 3rd and 6th columns of P, we obtain

A = (\begin{matrix} 0.054 & 0.045 & 0.054 & 0.006 & 0.009 & 0.132 \\ 0.027 & 0.0225 & 0.027 & 0.003 & 0.0045 & 0.066 \\ 0.0144 & 0.012 & 0.0144 & 0.0016 & 0.0024 & 0.0352 \\ 0.0072 & 0.006 & 0.0072 & 0.0008 & 0.0012 & 0.0176 \\ 0.0054 & 0.0045 & 0.0054 & 0.0006 & 0.0009 & 0.0132 \\ 0.072 & 0.06 & 0.072 & 0.008 & 0.012 & 0.176 \end{matrix})

By using Lemma 2 on submatrixes

(\begin{matrix} a_{k, k} & a_{k, 6} \\ a_{6, k} & a_{6, 6} \end{matrix}) for k = 1, 2, \dots, 5 in turn,

we renew and optimize A to

A = (\begin{matrix} 0.126 & 0.045 & 0.054 & 0.006 & 0.009 & 0.06 \\ 0.027 & 0.0825 & 0.027 & 0.003 & 0.0045 & 0.006 \\ 0.0144 & 0.012 & 0.0496 & 0.0016 & 0.0024 & 0 \\ 0.0072 & 0.006 & 0.0072 & 0.0088 & 0.0012 & 0.0096 \\ 0.0054 & 0.0045 & 0.0054 & 0.0006 & 0.0129 & 0.0012 \\ 0 & 0 & 0.0368 & 0 & 0 & 0.3632 \end{matrix});

then, using Lemma 2 on submatrix

(\begin{matrix} a_{1, 3} & a_{1, 6} \\ a_{6, 3} & a_{6, 6} \end{matrix}) = (\begin{matrix} 0.054 & 0.06 \\ 0.0368 & 0.3632 \end{matrix}),

we optimize A to the following

A^{'} = (\begin{matrix} 0.126 & 0.045 & 0.0908 & 0.006 & 0.009 & 0.0232 \\ 0.027 & 0.0825 & 0.027 & 0.003 & 0.0045 & 0.006 \\ 0.0144 & 0.012 & 0.0496 & 0.0016 & 0.0024 & 0 \\ 0.0072 & 0.006 & 0.0072 & 0.0088 & 0.0012 & 0.0096 \\ 0.0054 & 0.0045 & 0.0054 & 0.0006 & 0.0129 & 0.0012 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix}) .

In this situation, we denote

\bar{A} = (\begin{matrix} 0.126 & 0.045 & 0.0908 & 0.006 & 0.009 \\ 0.027 & 0.0825 & 0.027 & 0.003 & 0.0045 \\ 0.0144 & 0.012 & 0.0496 & 0.0016 & 0.0024 \\ 0.0072 & 0.006 & 0.0072 & 0.0088 & 0.0012 \\ 0.0054 & 0.0045 & 0.0054 & 0.0006 & 0.0129 \end{matrix}),

since

{\bar{a}}_{1, 1} = 0.126

is the largest entry and

\sum_{j = 1}^{5} {\bar{a}}_{1, j} = 0.2768 > \sum_{i = 1}^{5} {\bar{a}}_{i, 1} = 0.18,

we take

A^{″} = A^{'}

,

\bar{\bar{A}} = \bar{A}

, and

\bar{\bar{A}}

satisfies the conditions of Lemma 4 for

n = 5

.

Second, by Lemma 4,

\bar{\bar{A}}

is optimised to the following

(\begin{matrix} 0.18 & 0.0036 & 0.0782 & 0.006 & 0.009 \\ 0 & 0.1095 & 0.027 & 0.003 & 0.0045 \\ 0 & 0.0264 & 0.0496 & 0.0016 & 0.0024 \\ 0 & 0.006 & 0.0144 & 0.0088 & 0.0012 \\ 0 & 0.0045 & 0.0108 & 0.0006 & 0.0129 \end{matrix}) .

Noticing that the largest entry of matrix

(\begin{matrix} 0.1095 & 0.027 & 0.003 & 0.0045 \\ 0.0264 & 0.0496 & 0.0016 & 0.0024 \\ 0.006 & 0.0144 & 0.0088 & 0.0012 \\ 0.0045 & 0.0108 & 0.0006 & 0.0129 \end{matrix})

is

0.1095

, and

0.1095 + 0.0264 + 0.006 + 0.0045 = 0.1464 >

0.1095 + 0.027 + 0.003 + 0.0045 = 0.144

, we optimise

A^{″}

to

A^{⁗} = (\begin{matrix} 0.18 & 0.009 & 0.0782 & 0.006 & 0.0036 & 0.0232 \\ 0 & 0.0129 & 0.0108 & 0.0006 & 0.0045 & 0.0012 \\ 0 & 0.0024 & 0.0496 & 0.0016 & 0.0264 & 0 \\ 0 & 0.0012 & 0.0144 & 0.0088 & 0.006 & 0.0096 \\ 0 & 0.0045 & 0.027 & 0.003 & 0.1095 & 0.006 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix}) .

Finally, by similar procedures, we optimize

A^{⁗}

to

A^{(6)}

, to

A^{(8)}

, and then to

P^{'} = A^{(10)}

as follows:

A^{(6)} = (\begin{matrix} 0.18 & 0.009 & 0.006 & 0.0782 & 0.0036 & 0.0232 \\ 0 & 0.0174 & 0.0006 & 0.0108 & 0 & 0.0012 \\ 0 & 0.0012 & 0.0118 & 0.0174 & 0 & 0.0096 \\ 0 & 0.0024 & 0.0016 & 0.0736 & 0.0024 & 0 \\ 0 & 0 & 0 & 0 & 0.144 & 0.006 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix});

A^{(8)} = (\begin{matrix} 0.18 & 0.006 & 0.009 & 0.0782 & 0.0036 & 0.0232 \\ 0 & 0.0118 & 0.0036 & 0.015 & 0 & 0.0096 \\ 0 & 0.0022 & 0.0174 & 0.0092 & 0 & 0.0012 \\ 0 & 0 & 0 & 0.0776 & 0.0024 & 0 \\ 0 & 0 & 0 & 0 & 0.144 & 0.006 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix});

P^{'} = A^{(10)} = (\begin{matrix} 0.18 & 0.006 & 0.009 & 0.0782 & 0.0036 & 0.0232 \\ 0 & 0.014 & 0.0014 & 0.015 & 0 & 0.0096 \\ 0 & 0 & 0.0196 & 0.0092 & 0 & 0.0012 \\ 0 & 0 & 0 & 0.0776 & 0.0024 & 0 \\ 0 & 0 & 0 & 0 & 0.144 & 0.006 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix}) .

Clearly,

P^{'}

is upper triangular;

P^{'} \in C (p^{'}, q^{'})

with

p^{'} = (0.3, 0.04, 0.03, 0.08, 0.15, 0.4)

and

q^{'} = (0.18, 0.02,

0.03, 0.18, 0.15, 0.44) .

By Lemma 2,

P^{'}

can be further optimised to an upper triangular matrix

P^{″} = (\begin{matrix} 0.18 & 0 & 0 & 0.12 & 0 & 0 \\ 0 & 0.02 & 0 & 0 & 0 & 0.02 \\ 0 & 0 & 0.03 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0.06 & 0 & 0.02 \\ 0 & 0 & 0 & 0 & 0.15 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix}) .

Using Lemma 2 on the following submatrix of

P^{″}

(\begin{matrix} 0.02 & 0.02 \\ 0 & 0.02 \end{matrix})

and then exchanging the positions of its 2nd and 4th rows, we obtain

P^{‴} = (\begin{matrix} 0.18 & 0 & 0 & 0.12 & 0 & 0 \\ 0 & 0.02 & 0 & 0.06 & 0 & 0 \\ 0 & 0 & 0.03 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.04 \\ 0 & 0 & 0 & 0 & 0.15 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0.4 \end{matrix}) .

Note that

P^{‴}

is upper triangular and

P^{‴} \in C (p^{″}, q^{″})

with

p^{″} = (0.3, 0.08, 0.03, 0.04, 0.15, 0.4)

and

q^{″} = q^{'}

.

4. Conclusions

This study achieves the extremal problem of Shannon entropy for joint distributions with given marginals by establishing a localized optimization method to derive minimum-entropy physical structures. We rigorously demonstrate that minimum-entropy couplings necessitate an order-preserving structure (transformable into upper triangular matrixes via permutations), while maximum-entropy coupling aligns with independent distributions, thereby confirming entropy’s role in quantifying system disorder and addressing the gap in system coupling optimization. The proposed order-preserving coupling framework, combined with matrix structure analysis, not only explains its implications for communications and bioinformatics but also provides a novel approach to simplify exact solution computations through the coupled minimum-entropy system. These findings systematically highlight the critical connection between ordinal relationships and entropy optimization, offering both methodological tools and fresh perspectives for understanding ordered structures in complex systems.

Author Contributions

Theoretical derivation, F.W. and X.-Y.W.; manuscript writing and submission, Y.-J.M.; funding support and discussion, K.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China (grant number 11471222, 61973015), and was supported by Beijing Outstanding Young Scientist Program (No. JWZQ20240101027).

Data Availability Statement

No new data were created in the research.

Acknowledgments

The authors would like to thank Zhao Dong, from the Institute of Mathematics and Systems Science, Chinese Academy of Sciences, for useful discussion and comments.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Boltzmann, L. Weitere Studien über das Wärmegleichgewicht Unter Gasmolekülen; Aus der kk Hot-und Staatsdruckerei: Vienna, Austria, 1872. [Google Scholar]
Boltzmann, L. Über die Beziehung zwischen dem zweiten Hauptsatze der mechanischen Wärmetheorie und der Wahrscheinlichkeitsrechnung resp. den Sätzen ddotuber das Wärmegleichgewicht. Kais. Akad. Wiss. Wien Math. Naturwiss. Classe 1877, 76, 373–435. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
Li, X.D. Series of Talks on Entropy and Geometry; The Institute of Applied Mathematics, Chinese Academy of Sciences: Beijing, China, 2021. [Google Scholar]
Frechet, M. Sur les tableaux de correlation dont le marges sont donnees. Ann. Univ. Lyon Sci. Sect. A 1951, 14, 53–77. [Google Scholar]
Hoeffding, W. Masstabinvariante Korrelationtheorie. Schriften Math. Inst. Univ. Berl. 1940, 5, 181–233. [Google Scholar]
Fisher, N.I.; Sen, P.K. (Eds.) Scaleinvariant correlation theory. In The Collected Works of Wassily Hoeffding; English Translation; Springer: Berlin/Heidelberg, Germany, 1999; pp. 57–107. [Google Scholar]
Kovačević, M.; Stanojević, I.; Šenk, V. On the hardness of entropy minimization and related problems. In Proceedings of the 2012 IEEE Information Theory Workshop, Lausanne, Switzerland, 3–7 September 2012; pp. 512–516. [Google Scholar] [CrossRef]
Lin, G.D.; Dou, X.; Kuriki, S.; Huang, J.-S. Recent developments on the construction of bivariate distributions with fixed marginals. J. Stat. Distrib. Appl. 2014, 1, 14. [Google Scholar] [CrossRef]
Cicalese, F.; Gargano, L.; Vaccaro, U. Minimum-Entropy Couplings and Their Applications. IEEE Trans. Inf. Theor. 2019, 65, 3436–3451. [Google Scholar] [CrossRef]
Benes, V.; Stepan, J. (Eds.) Distributions with Given Marginals and Moment Problems; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Cuadras, C.M.; Fortiana, J.; Rodriguez-Lallena, J.A. (Eds.) Distributions with Given Marginals and Statistical Modeling; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Dall’Aglio, G.; Kotz, S.; Salinetti, G. (Eds.) Advances in Probability Distributions with Given Marginals; Springer: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
Ma, Y.J.; Wang, F.; Wu, X.Y.; Cai, K.Y. Calculating The Minimal Joint Entropy. 2024; in preparation. [Google Scholar]
Strassen, V. The existence of probability measures with given marginals. Ann. Math. Stat. 1965, 36, 423–439. [Google Scholar] [CrossRef]
Li, C.T. Efficient Approximate Minimum Entropy Couplinng of Multiple Probability Distribution. IEEE Trans. Inf. Theor. 2021, 67, 5259–5268. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Y.-J.; Wang, F.; Wu, X.-Y.; Cai, K.-Y. Research on the Characteristics of Joint Distribution Based on Minimum Entropy. Mathematics 2025, 13, 972. https://doi.org/10.3390/math13060972

AMA Style

Ma Y-J, Wang F, Wu X-Y, Cai K-Y. Research on the Characteristics of Joint Distribution Based on Minimum Entropy. Mathematics. 2025; 13(6):972. https://doi.org/10.3390/math13060972

Chicago/Turabian Style

Ma, Ya-Jing, Feng Wang, Xian-Yuan Wu, and Kai-Yuan Cai. 2025. "Research on the Characteristics of Joint Distribution Based on Minimum Entropy" Mathematics 13, no. 6: 972. https://doi.org/10.3390/math13060972

APA Style

Ma, Y.-J., Wang, F., Wu, X.-Y., & Cai, K.-Y. (2025). Research on the Characteristics of Joint Distribution Based on Minimum Entropy. Mathematics, 13(6), 972. https://doi.org/10.3390/math13060972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Characteristics of Joint Distribution Based on Minimum Entropy

Abstract

1. Introduction and Statement of the Result

2. Local Optimization and Proof of Result

2.1. Local Optimization Lemmas

2.2. Proof of Theorem 1

3. An Example

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI