Abstract
This paper focuses on the extreme-value issue of Shannon entropy for joint distributions with specified marginals, a subject of growing interest. It introduces a theorem showing that the coupling with minimal entropy must be essentially order-preserving, whereas the coupling with maximal entropy aligns with independence. This means that the minimum-entropy coupling in a two-dimensional system forms an upper triangular discrete joint distribution by exchanging the rows and columns of the joint distribution matrix. Consequently, entropy is interpreted as a measure of system disorder. This manuscript’s key academic contribution is in clarifying the physical meaning behind optimal-entropy coupling, where a special ordinal relationship is pinpointed and methodically outlined. Furthermore, it offers a computational approach for order-preserving coupling as a practical illustration.
Keywords:
Shannon entropy; minimum-entropy coupling; essentially order-preserving coupling; local optimization MSC:
60E15; 94A17
1. Introduction and Statement of the Result
The concept of entropy was introduced in thermodynamic and statistical mechanics as a measure of uncertainty or disorganization in a physical system [1,2]. In 1877, L. Boltzmann [2] gave the probabilistic interpretation of entropy and found the famous formula . Roughly speaking, entropy is the logarithm of the number of ways in which the physical system can be configured. The second law of thermodynamics says that the entropy of a closed system cannot decrease.
To reveal the physics of information, C. Shannon [3] introduced entropy into communication theory. Let X be a discrete random element with alphabet and probability mass , the entropy of X (or ) is defined by
Clearly, , which is called the Shannon entropy, takes its minimum 0 when X is degenerated and takes its maximum when X is uniformly distributed. In this sense, entropy is a measure of the uncertainty of a random element.
In the theory of information, the definition of entropy is extended to a pair of random variables as follows. Let be a two-dimensional random vector in with a joint distribution P = ; the joint entropy of (or P) is defined by
Another important concept for is mutual information, which is a measure of the amount of information that one random variable contains about the other. It is defined by
where and are the marginal distributions of X and Y. By definition, one has
Note that in some settings, the maximum of a variable’s mutual information is called its channel capacity, which plays a key role in information theory through the famous Shannon’s second theorem: the Channel Coding Theorem [3].
For basic concepts and properties in information theory, readers may refer to [4] and the references therein. For entropy and its developments in various topics of pure mathematics, readers may pay attention to the series of talks given by Xiang-Dong Li [5].
By (4), for the given marginals and , maximizing and minimizing are two sides of a single coin. To infer an unknown joint distribution of two random variables with given marginals is an old problem in the area of probabilistic inference. As far as we know, the problem goes back to at least Frechet [6] and Hoeffding [7,8], who studied the question of identifying the extremal joint distribution that maximized (or minimized, respectively) their correlation. For more studies in this area and more applications in pure and applied sciences, readers may refer to [9,10,11,12,13,14], etc.
In this paper, we consider the following setting of the problem described above. For simplicity, suppose , and and be two discrete probability distributions on . Let X and Y be the random variables in , with distributions and , respectively; then, we try to seek a minimum-entropy of two-dimensional random vector in with the marginals and .
One strategy to solve the problem mentioned above is to calculate the exact value of the minimum entropy, . Our efforts in this direction will be reported in a following paper [15]: depending on the local optimization lemmas given in Section 2, we successfully obtain an algorithm to calculate the exact value of the minimal joint entropy for any given marginals and . Note that the computation complexity for our algorithm is .
Another strategy to study the problem is to seek the unknown special structure of a minimum-entropy coupling . Clearly, in cases where X and Y are independent, the joint entropy takes the maximum . That is to say, the independent structure (maybe the most disordered one) of determines the maximum entropy, but what special structure in a coupling will determine the minimum entropy of a two-dimensional random system? The main goal of the present paper is to establish such a structure.
Denote by the set of all discrete probability distributions on . For each , let be the cumulative distribution function, defined by
Recall that a permutation is a bijective map from into itself. For any , define . From the Equation of (1), one has
which holds for any permutation . For a random variable X with distribution , the random variable has the distribution , where is the inverse of .
For any , denote by the set of all joint distributions P with the marginals . For any , suppose is distributed according to P. For any permutation pair , denote by the joint distribution of . Then,
For any , denote by the matrix such that ;; ; and . Otherwise, let . For any n-th order probability matrix P, let (resp. ); then, is the matrix obtained from P by exchanging the positions of its k-th and l-th rows (resp. columns). Furthermore, for any and any finite sequences and , there exists a unique permutation pair , such that
Definition 1.
For any , suppose is distributed according to . We note the following:
- 1.
- , or P, is order-preserving, if ;
- 2.
- , or P, is essentially order-preserving, if for some permutation pair ,
being order-preserving is also called “X being stochastically dominated by Y” in probability theory. Clearly, it holds if and any if any , , i.e., the joint distribution is upper triangular. For an essentially order-preserving , for the permutation pair given in Definition 1, is upper triangular.
For any , by Strassen’s Theorem [16] on stochastic domination, there exists a coupling such that P is upper triangular if and only if , where and are defined in (5). Note that in the literature on information theory (see [11,17] etc.), people usually say that “ is majorized by ” when .
Now, we turn to the following optimization problem:
Note that forms a compact subset of ; the existence of such a follows from the continuity of the entropy function H. Note that in this paper, we also call the minimum-entropy coupling.
What key structural characteristics of a coupling (X,Y) govern the minimum entropy in the two-dimensional stochastic system, as formulated in optimization problem (9)? Specifically, what mathematical properties must the joint distribution (X,Y) satisfy to achieve the global minimization of the joint entropy? Is there an intrinsic connection between such optimal coupling structures and the order-preserving properties of the variables, as suggested by entropy minimization principles? We state our main result as follows.
Theorem 1.
Suppose and . If solves the optimization problem (9), and is distributed according to , then , or , is essentially order-preserving.
Theorem 1 shows that the coupling with the minimal entropy must be essentially order-preserving, whereas the coupling with the maximal entropy aligns with independence. This means that the minimum-entropy coupling in a two-dimensional discrete system must be an upper triangular discrete joint distribution which can be formed by exchanging the rows and columns of the joint distribution matrix based on Theorem 3. Consequently, entropy is interpreted as a measure of system disorder.
Another structure of the minimum-entropy coupling is discovered in paper [15]: a tree structure in graph theory. Based on this structure, we develop an algorithm to obtain the exact value of the minimal entropy.
The rest of the paper is arranged as follows. In Section 2.1, we first develop the following local optimization lemmas: Lemma 2 and Lemma 3. Then, for any , by using the local optimization lemmas, we construct a such that . In Section 2.2, by Lemma 2, developed in Section 2.1, we prove Theorem 1. In Section 3, for for P, the independent coupling of is given as an example for Theorem 3, and we optimise P to an upper triangular .
2. Local Optimization and Proof of Result
2.1. Local Optimization Lemmas
Suppose is a nonnegative matrix (i.e., all its entries are nonnegative) such that . We generalize the definition of entropy for nonnegative matrix A as
Let , a probability matrix, then
For any , define and . Before the local optimization lemmas, we give out the following simple property for function without proof.
Lemma 1.
For any closed interval ,
Lemma 2.
For any second-order nonnegative matrix , suppose that and denote . Let such that , , and Then, . Furthermore, if , then .
Proof.
Without loss of generality, assume that . Note that in this case, one has and . For any , define
Let and ; then, .
By Lemma 1, and the assumption given above, and Thus, and ; then, by definition .
If , then . By Lemma 1, ; this implies that and . □
Lemma 3.
For any second-order nonnegative matrix , suppose that , , and . Let and define as in Lemma 2; then, .
Proof.
By Relation (11), without loss of generality, assume A to be a probability matrix. Rewrite the probability matrix by , and write and as its marginals, i.e., and , and and . By the conditions of the lemma, one has , and .
For any , let
Clearly, is a joint probability distribution matrix with marginals , and and . Particularly, when x runs over the interval , runs over .
Now, . It follows immediately from Lemma 1 that and take their maximums simultaneously when ; then, and . □
From the statements of Lemmas 2 and 3, for any second-order nonnegative matrix A, one may easily construct the nonnegative matrix , which possesses the same row and column summations as A, such that .
As a consequence of Lemma 3, the optimization problem (9) for is solved as follows.
Corollary 1.
For such that , , and , let
then
Now, as a direct consequence of Lemmas 2 and 3, we obtain the following local optimization theorem.
Theorem 2.
Suppose , , and . For any and , let
be the second-order submatrix of P. Let be given as in Lemma 2 or Lemma 3 and let be the matrix obtained from P by taking the place of A. Then, and .
2.2. Proof of Theorem 1
In this subsection, we give proof to Theorem 1. Actually, the strategy of the proof is to use the local optimization Lemma 2 repeatedly. First of all, we have the following lemma.
Lemma 4.
Let be an n-th-order nonnegative matrix. Suppose that
Then, by using the local optimization procedure developed in Lemma 2 and Theorem 2 at most times, finally transform A to such that ,
Furthermore, if and only if
Proof.
Let and , and denote by and the cardinalities of I and J. Without loss of generality, assume that satisfies
Write , and renew A by changing the second-order submatrix
Then, the renewed A still satisfies the conditions of Lemma 2, and then, by this lemma, its entropy is decreased strictly. Note that after using the local optimization procedure once, for the renewed matrix A, the number decreases by 1.
Repeat the above procedure until and write as the final renewing of A; thus, is obtained as required. □
In the situation given as in the statement of Lemma 4, we denote as the corresponding composite optimization procedure, and write .
Now, we finish the proof of Theorem 1 by proving the following theorem.
Theorem 3.
For any , , there exists a permutation pair and an upper triangular such that . Furthermore, cases where is not upper triangular for any permutation pair , .
Proof.
For , Theorem 3 follows from Lemma 2 immediately. For general and , suppose . In the case of
let (resp. ), where , , , and . By (8), for some given permutation pair , and A satisfies Condition (13). By Lemma 4, Equations (14) and (15) hold for .
Note that by (14), A and are all the probability matrixes in , and .
Now, has the form
Consider the following -th-order submatrix: of with (resp. ) for ,
suppose to be the entry of such that (resp. ).
For simplicity, we only treat the first case of (the resp. case in the brackets can be treated similarly); of course, in this case, one has
In the subcase when
let (resp. ). Clearly, by (8), for some given permutation pair , , and Here, we note the following:
- has the same form of as given in Equation (16), i.e., all entries except for in the first column of are zero. In fact, by transforming P to , we finished the first step of upper-triangulization;
- , as the -th-order submatrix of satisfies the condition of Lemma 4 for .
At this moment, we have finished the first step of the optimization of P by transforming P to . In fact, we are standing at the position to begin the second step by using Lemma 4 on for .
Repeat the above procedure for , , ⋯, and finally for 2; is then transformed to , , and finally to . Additionally, certain permutation pair sequences further exist, such that,
and
Now, let , . Then, , and by our setting, is upper triangular.
If is not upper triangular for any permutation pair , then at least one optimization step carried out above is concrete, i.e., when Lemma 2 is used, the corresponding b is strictly positive. Then, by Lemma 2 and the followed Theorem 2, . □
3. An Example
Theorem 1 shows that the coupling with the minimal entropy must be essentially order-preserving. How can one construct such entropy-equalizing order-preserving coupling structures? In this section, we offer a computational approach for an order-preserving coupling as a practical illustration. Without loss of generality, let us consider the following example: Let , , , and be the independent coupling of :
Now, we begin to optimise P to a upper triangular , as follows.
First, for the largest entry of P is and , by exchanging the positions of the 4th and 6th rows and then the positions of the 3rd and 6th columns of P, we obtain
By using Lemma 2 on submatrixes
we renew and optimize A to
then, using Lemma 2 on submatrix
we optimize A to the following
In this situation, we denote
since is the largest entry and
we take , , and satisfies the conditions of Lemma 4 for .
Second, by Lemma 4, is optimised to the following
Noticing that the largest entry of matrix
is , and , we optimise to
Finally, by similar procedures, we optimize to , to , and then to as follows:
Clearly, is upper triangular; with and
By Lemma 2, can be further optimised to an upper triangular matrix
Using Lemma 2 on the following submatrix of
and then exchanging the positions of its 2nd and 4th rows, we obtain
Note that is upper triangular and with and .
4. Conclusions
This study achieves the extremal problem of Shannon entropy for joint distributions with given marginals by establishing a localized optimization method to derive minimum-entropy physical structures. We rigorously demonstrate that minimum-entropy couplings necessitate an order-preserving structure (transformable into upper triangular matrixes via permutations), while maximum-entropy coupling aligns with independent distributions, thereby confirming entropy’s role in quantifying system disorder and addressing the gap in system coupling optimization. The proposed order-preserving coupling framework, combined with matrix structure analysis, not only explains its implications for communications and bioinformatics but also provides a novel approach to simplify exact solution computations through the coupled minimum-entropy system. These findings systematically highlight the critical connection between ordinal relationships and entropy optimization, offering both methodological tools and fresh perspectives for understanding ordered structures in complex systems.
Author Contributions
Theoretical derivation, F.W. and X.-Y.W.; manuscript writing and submission, Y.-J.M.; funding support and discussion, K.-Y.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Natural Science Foundation of China (grant number 11471222, 61973015), and was supported by Beijing Outstanding Young Scientist Program (No. JWZQ20240101027).
Data Availability Statement
No new data were created in the research.
Acknowledgments
The authors would like to thank Zhao Dong, from the Institute of Mathematics and Systems Science, Chinese Academy of Sciences, for useful discussion and comments.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Boltzmann, L. Weitere Studien über das Wärmegleichgewicht Unter Gasmolekülen; Aus der kk Hot-und Staatsdruckerei: Vienna, Austria, 1872. [Google Scholar]
- Boltzmann, L. Über die Beziehung zwischen dem zweiten Hauptsatze der mechanischen Wärmetheorie und der Wahrscheinlichkeitsrechnung resp. den Sätzen ddotuber das Wärmegleichgewicht. Kais. Akad. Wiss. Wien Math. Naturwiss. Classe 1877, 76, 373–435. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
- Li, X.D. Series of Talks on Entropy and Geometry; The Institute of Applied Mathematics, Chinese Academy of Sciences: Beijing, China, 2021. [Google Scholar]
- Frechet, M. Sur les tableaux de correlation dont le marges sont donnees. Ann. Univ. Lyon Sci. Sect. A 1951, 14, 53–77. [Google Scholar]
- Hoeffding, W. Masstabinvariante Korrelationtheorie. Schriften Math. Inst. Univ. Berl. 1940, 5, 181–233. [Google Scholar]
- Fisher, N.I.; Sen, P.K. (Eds.) Scaleinvariant correlation theory. In The Collected Works of Wassily Hoeffding; English Translation; Springer: Berlin/Heidelberg, Germany, 1999; pp. 57–107. [Google Scholar]
- Kovačević, M.; Stanojević, I.; Šenk, V. On the hardness of entropy minimization and related problems. In Proceedings of the 2012 IEEE Information Theory Workshop, Lausanne, Switzerland, 3–7 September 2012; pp. 512–516. [Google Scholar] [CrossRef]
- Lin, G.D.; Dou, X.; Kuriki, S.; Huang, J.-S. Recent developments on the construction of bivariate distributions with fixed marginals. J. Stat. Distrib. Appl. 2014, 1, 14. [Google Scholar] [CrossRef]
- Cicalese, F.; Gargano, L.; Vaccaro, U. Minimum-Entropy Couplings and Their Applications. IEEE Trans. Inf. Theor. 2019, 65, 3436–3451. [Google Scholar] [CrossRef]
- Benes, V.; Stepan, J. (Eds.) Distributions with Given Marginals and Moment Problems; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
- Cuadras, C.M.; Fortiana, J.; Rodriguez-Lallena, J.A. (Eds.) Distributions with Given Marginals and Statistical Modeling; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
- Dall’Aglio, G.; Kotz, S.; Salinetti, G. (Eds.) Advances in Probability Distributions with Given Marginals; Springer: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
- Ma, Y.J.; Wang, F.; Wu, X.Y.; Cai, K.Y. Calculating The Minimal Joint Entropy. 2024; in preparation. [Google Scholar]
- Strassen, V. The existence of probability measures with given marginals. Ann. Math. Stat. 1965, 36, 423–439. [Google Scholar] [CrossRef]
- Li, C.T. Efficient Approximate Minimum Entropy Couplinng of Multiple Probability Distribution. IEEE Trans. Inf. Theor. 2021, 67, 5259–5268. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).