Next Article in Journal
Maxwell’s Demon and Information Theory in Market Efficiency: A Brillouin’s Perspective
Previous Article in Journal
Dynamical Systems over Lie Groups Associated with Statistical Transformation Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Kangaroos in Cambridge †

1
Bontekoe Research, 1052 WJ Amsterdam, The Netherlands
2
New Lambton Heights, Newcastle, NSW 2305, Australia
*
Author to whom correspondence should be addressed.
Presented at the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris, France, 18–22 July 2022.
Phys. Sci. Forum 2022, 5(1), 22; https://doi.org/10.3390/psf2022005022
Published: 8 December 2022

Abstract

:
In this tutorial paper the Gull–Skilling kangaroo problem is revisited. The problem is used as an example of solving an under-determined system by variational principles, the maximum entropy principle (MEP), and Information Geometry. The relationship between correlation and information is demonstrated. The Kullback–Leibler divergence of two discrete probability distributions is shown to fail as a distance measure. However, an analogy with rigid body rotations in classical mechanics is motivated. A table of proper “geodesic” distances between probability distributions is presented. With this paper the authors pay tribute to their late friend David Blower.

1. Introduction

On my (RB) first meeting with Dr. John Skilling and Dr. Steve Gull in Cambridge in 1987, I was posed the following problem [1,2,3]:
In Australia, 3 / 4 of the kangaroos are right-handed and 1 / 3 have blue eyes. Can you construct the 2 × 2 probability table?
Having no clue about the use of their shorter forelegs, let alone any handedness, nor of the colour of their eyes, I assumed that:
  • a kangaroo is right-handed or left-handed; and
  • a kangaroo has blue eyes or green eyes.
This means that there are four distinct possibilities: right-handed with blue eyes, right-handed with green eyes, left-handed with blue eyes, and left-handed with green eyes. The statement space is of dimension 2 × 2 and has 4 cells, and a bare probability table looks like Table 1, showing the two given marginal values and the sum.
The two other marginal values result from normalizing the sum of the joint probabilities. Filling in the table a little more, we obtain Table 2. The notation Q i for probabilities originates with David Blower, who avoids the overused P-symbol. In this paper we follow Blower’s notation closely [4].
I thought about this problem for a short while and filled in the table by multiplying the row and column marginal values, as in Table 3.
However, then I was presented with the following set of equations
Q 1 + Q 2 = 3 / 4 Q 1 + Q 3 = 1 / 3 Q 1 + Q 2 + Q 3 + Q 4 = 1 .
There are only three equations in four unknowns, leaving any other (consistent) equations relating to the Q i redundant. This is an under-determined system. In my proposed solution, I must have used a fourth equation. So, where did this fourth equation come from? My answer was that I assumed that handedness and eye colour are independent, and thus the marginal probabilities could be multiplied. “Aah”, they said, “you have applied the Maximum Entropy Principle!”
Jaynes discussed and extended the kangaroo problem in the Fourth Maximum Entropy Workshop in 1984 [3].
This under-determined system has one free variable. Choosing Q 1 as the free variable, the equations reduce to
Q 2 = 3 / 4 Q 1 Q 3 = 1 / 3 Q 1 Q 4 = 1 / 12 + Q 1 .
A symbolic solution can be obtained by using Wolfram Mathematica’s Reduce[] function [5] as shown in Figure 1.
In this code snippet, the three equations can be recognized as well as the positivity condition. The solution is
1 / 12 Q 1 1 / 3 .
With this solution the probability table can be filled in as in Table 4.
Figure 2 shows a range of solutions to this problem. This figure illustrates the correlation and anti-correlation between the various Q i -s. Since Q 1 and Q 2 have to maintain their sum as 3 / 4 , they must be anti-correlated. Therefore the coloured lines cross each other between Q 1 and Q 3 . Similarly, Q 1 is anti-correlated with Q 3 . Therefore, Q 2 and Q 3 have to be correlated, and the coloured lines between them do not cross. Finally, Q 1 is correlated with Q 4 , which can be seen from the repeated Q 1 -axis at the right.

2. Variational Principles

A possible solution for an under-determined problem can be found by adopting a variational principle. This is a function of the joint probabilities to be optimized (maximized or minimized) under some constraints, whose free parameters correspond to the missing equations. Sivia considers four variational functions: the entropy, the sum of squares, the sum of logarithms, and the sum of square roots, as shown in Table 5 [1].
In the case of the Least Squares variational function we have
31 f Q = Q 1 2 + Q 2 2 + Q 3 2 + Q 4 2 = Q 1 2 + 3 / 4 Q 1 2 + 1 / 3 Q 1 2 + 1 / 12 + Q 1 2 = 4 Q 1 2 7 / 3 Q 1 + 49 / 72 .
This is a quadratic function and has a unique minimum at
Q 1 = 7 / 24 ,
which yields the exact solution of M VP , LeastSq
Q i = ( 7 / 24 , 11 / 24 , 1 / 24 , 5 / 24 ) .
For the Maximum Entropy, the variational function
f Q = Q 1 log Q 1 Q 2 log Q 2 Q 3 log Q 3 Q 4 log Q 4
has to be maximized, subject to the constraints. This function has a unique maximum at
Q 1 = 1 / 4 ,
which yields the exact solution of M VP , MaxEnt
Q i = ( 1 / 4 , 1 / 2 , 1 / 12 , 1 / 6 ) .
The solutions for Q 1 for the Maximum logarithms and Maximum square roots variational equations only can be obtained via numerical optimization. For each solution Q 1 , the other three Q i values follow directly from (2). The Variational Principle solutions are tabulated in Table 6 and visualized in Figure 3.
However, given these four different solutions to the kangaroo problem, we need a rationale for choosing one of them. Which one is ’best’? Sivia states that barring some evidence about a gene-linkage between handedness and eye colour for kangaroos, the MaxEnt model is preferred because this model provides the only uncorrelated assignment of the Q i . This is shown in Section 4.

3. State Space and Constraint Functions

In the kangaroo problem, we have two traits: handedness and eye colour. Each trait has a set of features; for the handedness they are “right-handed” and “left-handed”; for the eye colour “blue” and “green”. Mixtures of features are not allowed. Therefore, for every trait, one, and only one, feature applies; the features are mutually exclusive.
More abstractly, the features can be represented as statements. The combined features from different traits form joint statements. The joint statements define a state space of dimension n = 4 . The n cells uniquely number the joint statements. Table 7 shows the general setup.
Any joint statement about a kangaroo can be placed in one and only one cell of the state space. For example, a left-handed and blue-eyed kangaroo is uniquely defined by the joint statement X = x 3 . In this notation, the X denotes the two traits, and the x 3 specifies the features in cell 3. The state space is congruent to the probability table of Table 1, but it has a different role. The joint statements, X = x i , are logical statements which can be either True or False.
A constraint function is defined over the state space, as shown in Table 8. The function F assigns a Boolean value to each joint statement and returns a vector of values ([4], Ch. 21)
F X = x 1 , F X = x 2 , F X = x 3 , F X = x 4 .
The constraint function vector specifies the operation of a constraint.
The constraint function F 1 for our first constraint, “In Australia 3 / 4 of the kangaroos are right-handed ...,” is shown in Table 9. Writing out the constraint function vector for F 1 , we have
F 1 X = x 1 , F 1 X = x 2 , F 1 X = x 3 , F 1 X = x 4 = ( 1 , 1 , 0 , 0 ) .
The corresponding constraint function vector for the left-handed kangaroos is its complement, ( 0 , 0 , 1 , 1 ) .
The constraint function F 2 for the second constraint, “... and 1 / 3 have blue eyes,” is shown in Table 10. Writing out the constraint function vector F 2 , we obtain
F 2 X = x 1 , F 2 X = x 2 , F 2 X = x 3 , F 2 X = x 4 = ( 1 , 0 , 1 , 0 ) .
The constraint function vector for the blue-eyed kangaroos is ( 1 , 0 , 1 , 0 ) , and ( 0 , 1 , 0 , 1 ) for the green-eyed ’roos.
The probability distribution is normalized, which means that the sum of all joint probabilities is unity. This is also a constraint. The overall normalization is a universal constraint function vector
F 0 X = x 1 , F 0 X = x 2 , F 0 X = x 3 , F 0 X = x 4 = ( 1 , 1 , 1 , 1 ) .
This whole business of creating constraint function vectors for assigning probabilities may seem overly elaborate but conceptually, and operationally, we need a way to connect a statement X = x i with a numerical value. Technically, F is an operator that accepts a joint statement as its variable and returns a Boolean value. Furthermore, the constraint function vectors F j become the basis vectors  e j in the vector space of the information geometry in Section 6.

4. Correlation, Covariance, and Entropy

What do correlation and covariance actually mean, and what is the difference? Sometimes the two terms are used interchangeably.
We all have an intuitive interpretation. For instance, people’s heights and weights are correlated, which means that generally, tall persons weigh more than short ones. The two variables vary together; they are co-varying. However, this does not necessarily reflect a causal relationship. Gaining weight does not automatically imply becoming taller, as we all know.

4.1. Expectation

Suppose that a function V X = x i is defined over the state space and returns a numerical value for each joint statement. The expectation of V is
V = i = 1 n V X = x i Q i .
The sum is over all V X = x i values in the state space, whereas the Q i are from the probability table. The expectation value, V , is a numerical quantity.
With this definition, let’s compute the expectation for “right-handedness”. The constraint function vector for right-handedness, F 1 = ( 1 , 1 , 0 , 0 ) , acts as the quantity V
F 1 = i = 1 n F 1 X = x i Q i = F 1 X = x 1 Q 1 + F 1 X = x 2 Q 1 + F 1 X = x 3 Q 1 + F 1 X = x 4 Q 1 = 1 Q 1 + 1 Q 2 + 0 Q 3 + 0 Q 4 = Q 1 + Q 2 = 3 / 4 .
In the last step, we have used the information given in Table 2. The expectation for right-handedness thus equals its marginal value.
Similarly for “blue eyes”, with F 2 = ( 1 , 0 , 1 , 0 )
F 2 = i = 1 n F 2 X = x i Q i = 1 Q 1 + 0 Q 2 + 1 Q 3 + 0 Q 4 = Q 1 + Q 3 = 1 / 3 .
Furthermore, the expectation value for blue eyes again equals its marginal value.

4.2. Variance

The variance of the V X = x i values is defined as
var ( V ) = i = 1 n V X = x i V 2 Q i = V X = x i V 2 .
Notice that there are two nested sets of brackets . involved. The V is defined by (14).
By expanding the square, this can be rewritten as
var ( V ) = V X = x i V 2 = V X = x i 2 2 V X = x i V + V 2 = V X = x i 2 2 V X = x i V + V 2 = V X = x i 2 2 V X = x i V + V 2 = V X = x i 2 2 V V + V 2 = i = 1 n V X = x i 2 Q i V 2 .
We have used the properties V = V and V 2 = V 2 in the above derivation, because V is a constant.
So what is the variance of “right-handedness”? Taking V = F 1 , we obtain
var ( F 1 ) = i = 1 n F 1 X = x i 2 Q i F 1 2 = 1 2 Q 1 + 1 2 Q 2 + 0 2 Q 3 + 0 2 Q 4 F 1 2 = Q 1 + Q 2 F 1 2 = 3 / 4 3 / 4 2 = 3 / 16 .
The variance of “blue eyes” is
var ( F 2 ) = i = 1 n F 2 X = x i 2 Q i F 2 2 = 1 2 Q 1 + 0 2 Q 2 + 1 2 Q 3 + 0 2 Q 4 F 2 2 = Q 1 + Q 3 F 2 2 = 1 / 3 1 / 3 2 = 2 / 9 .
We conclude that both variances are independent of Q 1 .

4.3. Covariance

The covariance between two variables V X = x i and W X = x i is defined by
cov V , W = V X = x i V W X = x i W .
By a similar expansion as above, the product can be written as
cov V , W = V X = x i W X = x i V X = x i W W X = x i V + V W = V X = x i W X = x i V X = x i W W X = x i V + V W = V X = x i W X = x i V W = i = 1 n V X = x i W X = x i Q i V W .
What does this give for the cov F 1 , F 2 ? Expanding the sum and substituting the constraint function vectors F 1 = ( 1 , 1 , 0 , 0 ) and F 2 = ( 1 , 0 , 1 , 0 ) , we obtain
cov F 1 , F 2 = 1 1 Q 1 + 1 0 Q 2 + 0 1 Q 3 + 0 0 Q 4 F 1 F 2 = Q 1 3 / 4 1 / 3 = Q 1 1 / 4 .
We find that cov F 1 , F 2  does depend on Q 1 .
The variances and covariances can be combined in the variance-covariance matrix, which is defined by
Σ F 1 , F 2 = var F 1 cov F 1 , F 2 cov F 1 , F 2 var F 2 = 3 / 16 Q 1 1 / 4 Q 1 1 / 4 2 / 9 .
The variance-covariance matrix is related to the metric tensor g from information geometry in Section 6.

4.4. Correlation

The correlation coefficient is a single value derived from the variance and covariance values. It is defined as
ρ V , W = cov V , W var V var W .
Therefore the correlation between the eye colour and the handedness of the kangaroos is
ρ F 1 , F 2 = cov F 1 , F 2 var F 1 var F 2 = Q 1 1 / 4 3 / 16 2 / 9 = 2 6 Q 1 1 / 4 .
This finally confirms that indeed, the MaxEnt solution, with Q 1 = 1 / 4 , has zero correlation. We agree with Sivia that the other variational functions yield a positive or negative correlation between handedness and eye colour. (Notice that our correlation coefficients have the opposite sign, because Sivia correlates the left-handedness with blue eyes [1].) Table 11 shows the model solutions Q i and the corresponding correlation values.
One may have gotten the impression that the constraint function values are always 0 or 1, but these are specific for the problem treated in this paper. In general, a constraint function may yield any numerical value. The construction of a constraint function can be intricate; see, for example, Blower ([4], p. 63).

4.5. Entropy

The information entropy is a measure of the amount of missing information in a probability distribution. The information entropy H ( Q ) of a discrete probability distribution is
H ( Q ) = i = 1 n Q i log Q i .
Of all possible probability distributions, the discrete uniform distribution has the maximum missing information. Thus for n = 4 , we have q = 1 / 4 , 1 / 4 , 1 / 4 , 1 / 4 with
H ( q ) = i = 1 n 1 4 log 1 4 = log 4 1.39 .
When the natural logarithm log e is used, the units of entropy are nats. However, the entropy can also be defined in terms of the more familiar bits when log 2 is used. The conversion of H ( Q ) to bits by multiplying by log 2 e 1.44 gives
H ( q ) = log 4 log 2 e = 1.39 1.44 = 2 bits .
Maximum missing information of two bits exactly describes our minimum state of knowledge in a 2 × 2 state space with four equally probable states. We need one bit to choose a column and another bit to choose a row. Combined, we have fully specified one of four equally probable states or cells in the state space.
Absolute certainty is described by zero bits of missing information. This is attained when one Q i = 1 and all other Q j i = 0 . Then our state of knowledge is fully specified and there is no missing information. For example, a “certain distribution” is p = ( 0 , 0 , 1 , 0 ) , for which the entropy is
H p = 0 .
Here we have used
lim x 0 + x log x = 0 ,
and log 1 = 0.
The table in Table 11 shows the values for H ( Q ) , in bits, in the last column. Although all models have an entropy that is smaller than two bits, the numerical values of the entropy are not easily assessed intuitively. Jaynes gives an excellent explanation to guide one’s intuition ([6], Ch. 11.3).
Suppose we were first told about the kangaroos’ handedness, namely p 1 = 3 / 4 versus p 2 = 1 / 4 . The information entropy of this binary case is
H 2 p 1 , p 2 = 3 4 log 2 3 4 1 4 log 2 1 4 = 0.81 .
Next, we learn that the first alternative consists of two possibilities, namely blue and green eyes, with p 1 = q 1 + q 2 , where q 1 = 1 / 4 and q 2 = 1 / 2 . The information entropy for the ternary case becomes
H 3 q 1 , q 2 , p 2 = H 2 p 1 , p 2 + p 1 H 2 q 1 p 1 , q 2 p 1 = H 2 p 1 , p 2 + 3 4 1 3 log 2 1 3 2 3 log 2 2 3 = 0.81 + 0.69 = 1.50 .
Finally, the second alternative also consists of two possibilities, namely p 2 = q 3 + q 4 , with q 3 = 1 / 12 and q 4 = 1 / 6 . The information entropy becomes
H 4 q 1 , q 2 , q 3 , q 4 = H 3 q 1 , q 2 , p 2 + p 2 H 2 q 3 p 2 , q 4 p 2 = H 3 q 1 , q 2 , p 2 + 1 4 1 3 log 2 1 3 2 3 log 2 2 3 = 1.50 + 0.23 = 1.73 .
We recognize the same value as for M VP , MaxEnt in Table 11. In this example, the state space is gradually expanded and, as the number of cells increases one’s ambivalence also increases, which is reflected in an increase in the entropy. The example also shows that the subsequent H n are additive. Notice that the above partitioning of the p 1 and p 2 is proportional to the blue- and green-eyed kangaroos ratio.
For a given set of constraints, of all possible models, the maximum entropy solution has the highest information entropy ([4], Ch. 24.2), which is confirmed in Table 11. This means that the M VP , MaxEnt solution has the most missing information. Consequently, in one way or another, some extra information was introduced by the other variational functions. From the example above, one may surmise that the additional information originates from a different partitioning of the p 1 and p 2 into the q i -s.
This extra information also shows up as non-zero correlations; the higher the absolute value of the correlation, the lower the information entropy. Therefore, correlation induces information, reducing the amount of missing information.

5. Maximum Entropy Principle

Although we have already obtained several solutions to the kangaroo problem by the optimization of various variational functions, the procedure may be seen as ad hoc. The Maximum Entropy Principle (MEP) is a versatile problem-solving method based on the work of Shannon and Jaynes ([6], Ch. 11; [3,7]). The MEP is a method with highly desirable features for making numerical assignments, and, most importantly, all conceivable legitimate numerical assignments may be made, and are made, via the MEP. The book by Blower [4] is entirely devoted to the MEP.

5.1. Interactions

Blower defines the interaction between two (or more) constraints as the product of their constraint function vectors. Here we have two constraints, which can have only one interaction, namely between “right-handed” and “blue eyes”. In problems with more dimensions, higher-dimensional interactions can be defined by the product of three or more constraint function vectors.
The interaction vector is the element-wise product of the relevant constraint function vectors
F 3 X = x i = F 1 X = x i F 2 X = x i = ( 1 , 1 , 0 , 0 ) ( 1 , 0 , 1 , 0 ) = ( 1 , 0 , 0 , 0 ) .
From Table 12, we see how F 3 X = x i selects the interaction between “right-handed” and “blue eyes”. This interaction singles out the X = x 1 statement in the state space and, consequently, the Q 1 joint probability. Keeping our terminology simple, this interaction vector is also called a constraint function vector.
There are now three constraint function vectors
F 1 X = x i = ( 1 , 1 , 0 , 0 ) F 2 X = x i = ( 1 , 0 , 1 , 0 ) F 3 X = x i = ( 1 , 0 , 0 , 0 ) ,
which can be combined to form the constraint function matrix
M = 1 1 0 0 1 0 1 0 1 0 0 0 .
The constraint function matrix has dimensions m × n . As in Section 4, the expectation value of the interaction F 3 is
F 3 = i = 1 n F 3 X = x i Q i = 1 Q 1 + 0 Q 2 + 0 Q 3 + 0 Q 4 = Q 1 .
The three expectation values are combined to form the constraint function average vector
F 1 F 2 F 3 = 3 / 4 1 / 3 Q 1 .
The constraint function average vector F 1 , F 2 , F 3 is related to the contravariant coordinates η 1 , η 2 , η 3 in information geometry in Section 6.
In an under-determined problem, the number of constraints (primary and interaction) is m < n 1 . In our case m = 3 , therefore combined with the normalization of the probability distribution, we have a linear system of four equations with four unknowns. However, in this paper, we take a general approach as if we had an under-determined system with m < n 1 .
Returning to our kangaroo problem, from the MEP perspective, we will obtain four models M MEP , k defined by their constraint function averages F 1 , F 2 , F 3 . The set-up of the problem fixes values of F 1 and F 2 , whereas the third value, F 3 , is taken as the Q 1 -s from the M VP , k model solutions, as shown in Table 11.

5.2. The Maximum Entropy Principle

The MEP involves a constrained optimization problem utilizing the method of Lagrange multipliers. According to Jaynes, the MEP provides the most conservative, non-committal distribution where the missing information is as ‘spread-out’ as possible, yet which accords with no other constraints than those explicitly taken into account.
The MEP solution in its canonical form is ([4], p. 50)
Q i = exp j = 1 m λ j F j X = x i Z ( λ ) .
Here Q i is the probability for the joint statement X = x i . The F j X = x i is the j-th constraint function operator acting on the i-th joint statement. The λ j are the Lagrange multipliers, each corresponding to a constraint function. The summation is over all m constraints. The Z ( λ ) in the denominator normalizes the joint probabilities and is called the partition function
Z ( λ ) = i = 1 n exp j = 1 m λ j F j X = x i .
For our kangaroo problem the MEP solution can be written as
Q i = exp λ 1 F 1 X = x i + λ 2 F 2 X = x i + λ 3 F 3 X = x i Z ( λ 1 , λ 2 , λ 3 ) ,
with
Z ( λ 1 , λ 2 , λ 3 ) = i = 1 n exp λ 1 F 1 X = x i + λ 2 F 2 X = x i + λ 3 F 3 X = x i .
The arguments of the exponents can be written in vector-matrix notation, using the constraint function matrix (37)
( λ 1 , λ 2 , λ 3 ) · 1 1 0 0 1 0 1 0 1 0 0 0 = λ 1 + λ 2 + λ 3 , λ 1 , λ 2 , 0 .
The partition function then becomes
Z ( λ 1 , λ 2 , λ 3 ) = exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 .
The joint probabilities (42) are expressed in full as
Q 1 = exp λ 1 + λ 2 + λ 3 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 Q 2 = exp λ 1 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 Q 3 = exp λ 2 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 Q 4 = 1 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 ,
and the three Lagrange parameters ( λ 1 , λ 2 , λ 3 ) are the solutions of the three constraint equations
Q 1 + Q 2 = exp λ 1 + λ 2 + λ 3 + exp λ 1 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 = F 1 Q 1 + Q 3 = exp λ 1 + λ 2 + λ 3 + exp λ 2 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 = F 2 Q 1 = exp λ 1 + λ 2 + λ 3 exp λ 1 + λ 2 + λ 3 + exp λ 1 + exp λ 2 + 1 = F 3 .
This is a non-linear problem in three unknowns. Solving the Lagrange parameters usually requires an advanced numerical approximation technique. The Legendre transform provides such a method, which is described in detail by Blower ([4], Ch. 24), and demonstrated in the code example in Figure 4. In some cases, the λ j can be obtained exactly, as we will see below.
Our four models are distinguished only by their constraint function average, F 3 = Q 1 , in (39). The details are shown in Table 13.
The constraint function vectors are shown in the second column. The three Lagrange parameters are shown in the third column. From this column one can learn that all three Lagrange parameters λ j vary, even when only the value of F 3 is varied. Substituting these λ 1 , λ 2 , λ 3 in (46), the probability distributions Q i of the last column are obtained. In our case, these MEP solutions are the same as those obtained by the variational principle methods in table in Table 6, but this need not be so in general. The Lagrange parameters ( λ 1 , λ 2 , λ 3 ) are related to the covariant coordinates θ 1 , θ 2 , θ 3 of information geometry in Section 6.
Close inspection of the table in Table 13 reveals that the Lagrange multiplier λ 3 = 0 for M MEP , MaxEnt solution. This is an important observation because it signals that the F 3 X = x i constraint function is redundant and, consequently, can be removed. The solution for the joint probabilities Q i using only F 1 , F 2 is identical to the one with F 1 , F 2 , F 3 . Actually, we knew this already, as this was the basis of the solution in Table 3, but the MEP provides a systematic method for detecting redundancies ([6], p. 369, [8], p. 108).
The Lagrange parameters can be solved algebraically for the M MEP , MaxEnt and the M MEP , LeastSq models. Recall that the M VP , MaxEnt and M VP , LeastSq models gave exact solutions for the Q i , namely from substituting (8) and (5) in (2). From (46), we see that Z = 1 / Q 4 , therefore the value of the partition function is exactly known. Subsequently, the exp λ j can be solved algebraically from (46).
Since the M VP , k and the M MEP , k model results turn out to be identical, the distinction based on their solution method can now be dropped. For consistency, we keep the redundant F 3 X = x i constraint function in the M MaxEnt model.

6. Information Geometry

6.1. Coordinate Systems

In Information Geometry (IG), a discrete probability distribution Q i is represented by a point in a manifoldS. A manifold of dimension n is denoted by S n ; in our case n = 4 . The probability distribution is parameterized by two dual coordinate systems, namely a covariant system denoted by superscripts θ 0 , θ 1 , θ 2 , θ 3 and a contravariant system denoted by subscripts η 0 , η 1 , η 2 , η 3 . This notation corresponds to the work of Amari [9]. The book by Blower [8] is entirely devoted to IG, and in this section we follow his notation.
The contravariant coordinate system corresponds to the constraint function averages
η 0 , η 1 , η 2 , η 3 = F 0 , F 1 , F 2 , F 3 ,
whereas the covariant coordinates are the Lagrange multipliers
θ 0 , θ 1 , θ 2 , θ 3 = λ 0 , λ 1 , λ 2 , λ 3 .
The normalization of the probability distribution is given by
F 0 = η 0 = 1 .
This definition yields for the first covariant coordinate
λ 0 = θ 0 = 1 log Z ,
where Z is the partition function (41). For example, the uniform distribution q in the covariant coordinate system is
λ 0 , λ 1 , λ 2 , λ 3 = 1 log 4 , 0 , 0 , 0 ,
and in the contravariant coordinate system
F 0 , F 1 , F 2 , F 3 = 1 , 1 / 2 , 1 / 2 , 1 / 4 .
In IG, the normalization is always implicitly assumed; therefore the coordinates η 0 and θ 0 are never shown explicitly. In the remainder of this paper, only three coordinates are used, namely
F 1 , F 2 , F 3 = η 1 , η 2 , η 3 ,
and
λ 1 , λ 2 , λ 3 = θ 1 , θ 2 , θ 3 .

6.2. Tangent Space

All modeling takes place in a sub-manifold S m , which is tangent to the manifold S n . This is illustrated in Figure 5. In our kangaroo problem m = 3 .
Perhaps it is tempting to think of a probability distribution Q as a vector in S n , with a coordinate system along the axes as in Figure 6. However, this notion is conceptually wrong because the probability distribution is normalized by
Q 1 + Q 2 + Q 3 + Q 4 = 1 ,
and not as
Q 1 2 + Q 2 2 + Q 3 2 + Q 4 2 = 1 .
We will return to the issue of normalization in Section 6.6.
The manifold has no familiar extrinsic set of coordinate axes by which all points can be referenced. All we have is this austere representation of points mapped to a coordinate system ([8], p.46). The tangent space is spanned by a set of m basis vectors. The natural basis vectors of the tangent space are
e r = F r X = x i F r ,
where we recognize the constraint function vector F r X = x i and the corresponding constraint function average F r . Notice that the constraint function average F r is subtracted from every element of the constraint function vector F r X = x i . For the least squares model solution M LeastSq , the basis vectors are
e 1 , e 2 , e 3 = 1 / 4 1 / 4 3 / 4 3 / 4 , 2 / 3 1 / 3 2 / 3 1 / 3 , 17 / 24 7 / 24 7 / 24 7 / 24 ,
where we have used (36) and (39), and substituted the Q i using (5).
These basis vectors are not orthogonal. The angle ϕ between two vectors v and w is given by
cos ( ϕ ) = v · w v w .
This gives for the angles in degrees between e 1 and e 2 , e 1 and e 3 , and e 2 and e 3 : 98.1 , 56.1 , and 59.0 , respectively. The basis vectors are also not normalized; their lengths are defined as e r and found to be 1.12 , 1.05 , and 0.87 , respectively. However, the e r of (57) are perpendicular to the probability distribution (6) from the model M VP , LeastSq
Q i = ( 7 / 24 , 11 / 24 , 1 / 24 , 5 / 24 ) ;
all mutual angles ϕ are 90.0 .
Since the basis vectors e r do not form an orthogonal coordinate system, for an arbitrary vector there are two possible projections. Covariant coordinates are obtained by a projection parallel to the basis vectors, while contravariant coordinates are obtained by a perpendicular projection onto the basis vectors.

6.3. Metric Tensor

Each probability distribution p in the manifold S n has an associated metric tensor G ( p ) . The metric tensor is an additional structure that allows the definition of distances and angles in the manifold.
The metric tensor is a symmetric matrix, and it comes in covariant and contravariant forms which are each other’s matrix inverse. The contravariant metric tensor turns out to be the same as the variance-covariance matrix ([8], p. 50). In our notation the covariant form is g r c , and the contravariant form is g r c , where the superscripts and subscripts r and c refer to the matrix row and column index.
The elements of the contravariant metric tensor are defined as inner products
g r c = F r X = x i F r , F c X = x i F c = i = 1 n F r X = x i F r F c X = x i F c Q i = i = 1 n F r X = x i F c X = x i Q i F r F c .
The sum is over all state space cells, whereas the r and c are fixed. Notice that this is the same computation as (22) for the covariance between two vectors.
In the locally flat tangent space S m , the two coordinate systems are non-orthogonal, and the metric tensor forms the local transformation between the two coordinate systems,
F c λ r = η c θ r = g r c ,
and its inverse
λ r F c = θ r η c = g r c .
In Blower’s notation the contravariant F j and covariant λ j vector indices do not follow the common Einstein convention.
The metric tensor can be computed by
g r c = 2 log Z λ r λ c ,
with Z the partition function (43)
Z ( λ ) = e λ 1 + λ 2 + λ 3 + e λ 1 + e λ 2 + 1 .
The contravariant metric tensor for our kangaroo problem is most easily expressed in the covariant coordinates λ 1 , λ 2 , λ 3
G ( λ ) = 1 Z 2 e λ 1 e λ 2 + 1 e λ 2 + λ 3 + 1 e λ 1 + λ 2 e λ 3 1 e λ 1 + λ 2 + λ 3 e λ 2 + 1 e λ 1 + λ 2 e λ 3 1 e λ 2 e λ 1 + 1 e λ 1 + λ 3 + 1 e λ 1 + λ 2 + λ 3 e λ 1 + 1 e λ 1 + λ 2 + λ 3 e λ 2 + 1 e λ 1 + λ 2 + λ 3 e λ 1 + 1 e λ 1 + λ 2 + λ 3 e λ 1 + e λ 2 + 1 .
The Wolfram Mathematica [5] code which yields this symbolic expression is surprisingly compact, as shown in Figure 7. This short piece of code demonstrates the indispensability of a good symbolic tool when doing IG.
Substituting the appropriate Lagrange parameters from Table 13, the metric tensor for the least squares model solution M LeastSq is
G LeastSq = 3 / 16 1 / 24 7 / 96 1 / 24 2 / 9 7 / 36 7 / 96 7 / 36 119 / 576 ,
and for the maximum entropy model M MaxEnt , we obtain
G MaxEnt = 3 / 16 0 1 / 16 0 2 / 9 1 / 6 1 / 16 1 / 6 3 / 16 .
Here we can see that the upper-left 2 × 2 sub-matrices are identical to the variance-covariance matrix of (24). The extension to the 3 × 3 matrices is due to the added interactions F 3 X = x i .

6.4. Kullback–Leibler Divergence

The Kullback–Leibler divergence allows for the determination of the differences in information content between two probability distributions. The Kullback–Leibler divergence between two discrete probability distributions p and q is defined as
K L p q = i = 1 n p i log p i q i .
The divergence is not a distance because the expression is not symmetric in p and q. A common way to refer to Kullback–Leibler divergence (KL) is as the relative entropy of p with respect to q or the information gained from p over q.
For example, with p = ( 0 , 0 , 1 , 0 ) and q = ( 1 / 4 , 1 / 4 , 1 / 4 , 1 / 4 ) we have
K L p q = 0 log 0 1 / 4 + 0 log 0 1 / 4 + 1 log 1 1 / 4 + 0 log 0 1 / 4 = log 4 ,
where we have used the limit expression (31) again. However, when we interchange p and q we obtain
K L q p = 1 / 4 log 1 / 4 0 + 1 / 4 log 1 / 4 0 + 1 / 4 log 1 / 4 1 + 1 / 4 log 1 / 4 0 = .
Therefore, figuratively speaking, we have gained a finite amount of information when learning that we are certain, but we have lost an “infinite” amount when we lose our certainty. Learning and forgetting are asymmetric.
Therefore, the notion of the KL-divergence as a distance measure between distinct probability distributions is flawed. Rewriting (68) we obtain
K L p q = i = 1 n p i log p i q i = i = 1 n p i log p i i = 1 n p i log q i = log q p H ( p )
where H ( p ) is the entropy of p. The first term on the right is the expectation of log q with respect to p. When q p , K L p q and log q p are strictly positive quantities.
The KL-divergence can be expressed in bits when (68) is multiplied by log 2 e 1.44 . Table 14 shows the values for our four models. As expected, the table is not symmetric.
When the distributions p and q = p + d p are infinitesimally close, writing
q i = p i + d p i ,
we have
i = 1 n d p i = 0 .
Expanding the KL-divergence for small d p
K L p q = i = 1 n p i log p i q i = i = 1 n p i log q i p i = i = 1 n p i log 1 + d p i p i = i = 1 n d p i + i = 1 n 1 2 d p i 2 p i O d p 3 1 2 i = 1 n d p i 2 p i .
This expansion is a sum of squares, which is symmetric. Therefore, the KL-divergence is commutative for infinitesimal separations between p and q.
This property of the Kullback–Leibler divergence has an analogy in classical mechanics, namely that two infinitesimal rotations of a rigid body along different principal axes are commutative, while finite rotations are not.

6.5. Distances

What is the distance between two discrete probability distributions p and q in the manifold S n ? This is at the heart of Information Geometry. For a distance we need a curve connecting the two points. There are many possibilities. What would be the length of such curves? Which one is the shortest? The shortest of all possible curves is called a geodesic. Suppose that s is a curve connecting p and q, then any point t on the curve s is a probability distribution. Therefore, we have a continuum of probability distributions along s in the manifold S n .
For two close-by points p and q = p + d p , their distance is a function of the KL-divergence, namely ([8], pp. 77–78)
d s = 2 K L ( p q ) .
The same distance is given by
d s = r = 1 m c = 1 m g r c ( p ) d λ r d λ c ,
where the covariant coordinates λ and λ + d λ of p and q are used, and the metric tensor g r c ( p ) is evaluated as in (65). However, there is a subtle difference here, namely the KL-divergence in (75) is computed in the full manifold S n , whereas d s in (76) is computed in the tangent space S m , with m < n .
When the two distributions are finitely separated, as is the case for our models M k , the length of the curve s ( t ) is the integral from p to q of
L ( s ) = p q s ( t ) d t ,
where s ( t ) is the curve in S n parameterized by the probability distribution t, and s ( t ) is its first derivative. The tangent sub-manifold S m ( t ) follows t along s ( t ) from p to q, and the Lagrange parameters λ ( t ) and the metric tensor g r c ( t ) vary with t. However, finding the distance D = min L ( s ) is an Euler–Legendre variational problem beyond the scope of this paper [10].

6.6. Angular Distances

The distance between two probability distributions can also be found as the arc length of a great circle on a sphere in S n . This is known as the Bhattacharyya angle.
Substituting (74) in (75) we can write
d s 2 = i = 1 n d p i 2 p i = r = 1 m c = 1 m g r c ( p ) d p r d p c ,
with a metric tensor
g r c ( p ) = 1 / p r for r = c 0 otherwise .
Using the transformation
ψ i = p i
we define ψ as a point on the positive orthant of the unit sphere with
i = 1 n ψ i 2 = i = 1 n p i = 1 .
This effectively restricts ψ to a sub-manifold of dimension S n 1 . The geometry is illustrated by Figure 8. In the ψ -coordinate system, the infinitesimal distance becomes
d s 2 = i = 1 n d p i 2 p i = i = 1 n 2 ψ d ψ i 2 ψ i 2 = 4 i = 1 n d ψ i 2 ,
or
d s = 2 d ψ .
Notice that in this coordinate system the metric tensor is the Euclidean metric tensor
g r c ( ψ ) = 1 for r = c 0 otherwise .
With this transformation the probability distributions become points on a hypersphere with a unit radius in ( n 1 ) dimensions. However, it is well known that geodesics on a sphere are great circles. Therefore, the distance can be obtained by the path integral (77) along a great circle connecting the two points. The arc length between two points is the subtended angle θ between two points ψ 1 and ψ 2 on the unit hypersphere
θ = arccos ψ 1 · ψ 2 = arccos i = 1 n ψ 1 , i ψ 2 , i = arccos i = 1 n p i q i .
This remarkable result is the Bhattacharyya angle between two probability distributions [11]. The distance D between p and q is twice the arc length from (83)
D ( p , q ) = 2 θ = 2 arccos i = 1 n p i q i .
The units of D are radians. The maximum distance of π radians is achieved between two orthogonal distributions.
With this result we can compute the symmetric distance table for our four Kangaroo models, shown in Table 15; the numerical values are converted from radians to degrees. From this table we see that the largest distance is between the models M LeastSq and M MaxLog . This observation corresponds with these models having the biggest difference in their correlation coefficients ρ F 1 , F 2 in Table 11.
Interestingly, when we define lower and upper bounds
K L min = min K L p q , K L q p K L max = max K L p q , K L q p ,
all the distances from Table 15 have values
2 K L min < D ( p , q ) < 2 K L max .
Although we have no proof, this observation suggests that the two forms of the KL-divergence may act as lower and upper limits for the true distance D ( p , q ) .

6.7. Geodesics

The arc of the great circle connecting the two points can be found as follows [12]. Let v 1 and v 2 be two points on the ( n 1 ) dimensional hypersphere, then
w = v 2 v 2 v 1 v 1
u = 1 w w .
Then
α ( τ ) = cos τ v 1 + sin τ u
traces out a great circle through v 1 and v 2 . It starts at α ( τ ) = v 1 when τ = 0 , it reaches α ( τ ) = v 2 at τ = arccos v 2 v 1 , and returns to v 1 when τ = 2 π . Here we recognize the Bhattacharyya angle again.
When v 1 = ψ 1 and v 2 = ψ 2 represent two probability distributions, they must remain on the positive orthant of the hypersphere. For 0 τ arccos ψ 2 ψ 1 ,
t = α 2 ( τ )
is a probability distribution in S n on the geodesic connecting ψ 1 and ψ 2 .
Our under-determined problem is parametrized by a single variable, namely 1 / 12 Q 1 1 / 3 from (3), which implies that there is only one dimension involved. Therefore it seemed reasonable to surmise that varying Q 1 traces out probability distributions t along the shortest distance between the various models, but this turned out to be incorrect. The distributions t on the geodesic s ( t ) connecting, for example, M LeastSq to M MaxEnt , do not comply with the constraint function average vector (39)
F 1 , F 2 , F 3 = 3 / 4 , 1 / 3 , Q 1 .
except for the endpoints.

6.8. Distances Revisited

Our knowledge of the geodesic s ( t ) allows us to verify (75) with (76). The arc length of the geodesic between p and q is
s = p q d s ( t ) d t = p q r = 1 m c = 1 m g r c ( t ) d λ ( t ) r d λ ( t ) c d t ,
where we have substituted (76). Notice that the metric tensor as well as the covariant coordinates depends on t. This integral can be approximated by a sum of many small steps in t ([8], p.78).
By taking K small segments, the distance s is approximated by
s = k = 0 K 1 r = 1 m c = 1 m λ t k λ t k + 1 r g r c λ ( t k ) λ t k λ t k + 1 c 1 2 .
Here k = 0 corresponds to the probability distribution t 0 = p and k = K is the distribution t K = q . The intermediate points t k are obtained by dividing the arc 0 τ arccos p q of the hypersphere into K equal angular segments. The corresponding probability distributions are t k = α 2 τ k , using (92).
For each t k in (95), the constraint function averages F 1 , F 2 , F 3 k are obtained through the multiplication by the constraint function vectors (36). The corresponding covariant coordinates λ ( t k ) are computed by solving the set of equations in (47), as illustrated by Figure 4. Finally, the metric tensor g r c λ ( t k ) is obtained through substitution of λ ( t k ) in (65). By taking K = 128 segments and performing the computation of (95) we have confirmed all the numerical values in Table 15. This confirms the equivalence of (75) and (76).

7. Conclusions

The Gull–Skilling kangaroo problem provides a useful setting for illustrating the solution of under-determined problems in probability. The Variational Principle—in conjunction a variational function—effectively creates enough missing information for a complete solution, but not necessarily the minimum amount. In this paper four different Variational Principle solutions are shown, only one of which introduces the minimum amount, when the variational function is the Shannon–Jaynes entropy function.
The Maximum Entropy Principle is an alternative method for solving under-determined problems, which however avoids any implicit introduction of extra information not in the original problem. This information manifests itself in our examples as added correlations in the solutions.
The Kullback–Leibler divergence allows for the determination of the differences in information content between two probability distributions, but it cannot be used as a distance measure. It is symmetrical for infinitesimal separations. We point out an analogy with infinitesimal rigid body rotations.
Through the lens of Information Geometry, the actual geometric distance between two probability distributions along a geodesic path, can also be expressed as twice the Bhattacharyya angle in a hypersphere. In this paper, we illustrate the equivalence of these two geometrical concepts.
We also find that the mutual differences in distance between any two models, are directly reflected in the difference of their correlation coefficients.
Our understanding of the kangaroo problem and its implications has been particularly facilitated by the symbolic programming capabilities of Wolfram Mathematica.

Author Contributions

The authors R.B. and B.J.S. contributed equally to the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This paper was written to honor our late friend David Blower. The reader may benefit from his books, as we have. We acknowledge the comments of John Skilling, who pointed out the Bhattacharyya angle to us. Further we thank Ann Stokes, Ali Mohammad-Djafari and two anonymous referees for supporting comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sivia, D.S.; Skilling, J. Data Analysis, 2nd ed.; Oxford University Press: Oxford, UK, 2006; pp. 111–113. [Google Scholar]
  2. Gull, S.F.; Skilling, J. Maximum entropy method in image processing. IEE Proc. 1984, 131, 646–659. [Google Scholar] [CrossRef]
  3. Jaynes, E.T. Monkeys, kangaroos and N. In Maximum Entropy and Bayesian Methods in Applied Statistics; Justice, J. H., Ed.; Cambridge University Press: Calgary, AB, Canada, 1984; pp. 27–58. [Google Scholar]
  4. Blower, D.J. Information Processing, Volume II, The Maximum Entropy Principle; Third Millennium Inferencing: Pensacola, FL, USA, 2013. [Google Scholar]
  5. Wolfram Mathematica. Available online: www.wolfram.com (accessed on 7 December 2022).
  6. Jaynes, E.T. Probability Theory: The Logic of Science; Bretthorst, G.L., Ed.; Cambridge University Press: New York, NY, USA, 2003. [Google Scholar]
  7. Buck, B; Macaulay, V. A. Maximum Entropy in Action; Oxford University Press: Oxford, UK, 1991. [Google Scholar]
  8. Blower, D.J. Information Processing, Volume III, Introduction to Information Geometry; Third Millennium Inferencing: Pensacola, FL, USA, 2016. [Google Scholar]
  9. Amari, S.; Nagaoka, H. Methods of Information Geometry; Originally Published in Japanese by Iwanami Shoten Publishers, Tokyo, Ed.; Translated by D. Harada; Oxford University Press: Oxford, UK, 1993. [Google Scholar]
  10. Mathews, J.; Walker, R.L. Mathematical Methods of Physics, 2nd ed.; Addison-Wesley Publ.: Menlo Park, CA, USA, 1973; pp. 322–344. [Google Scholar]
  11. Bhattacharyya, A. On a Measure of Divergence between Two Multinomial Populations. Sankhyā 1946, 7, 401–406. [Google Scholar]
  12. Mathematics Stack Exchange. Available online: https://math.stackexchange.com/questions/1883904/a-time-parameterization-of-geodesics-on-the-sphere (accessed on 15 September 2022).
Figure 1. Wolfram Mathematica code for solving the under-determined problem (2).
Figure 1. Wolfram Mathematica code for solving the under-determined problem (2).
Psf 05 00022 g001
Figure 2. Parallel-axis plot of the Q 1 , Q 2 , Q 3 , and Q 4 , for Q 1 between 1 / 12 (red) and 1 / 3 (purple) in equidistant steps of 1 / 16 . For clarity, the Q 1 axis is repeated on the right.
Figure 2. Parallel-axis plot of the Q 1 , Q 2 , Q 3 , and Q 4 , for Q 1 between 1 / 12 (red) and 1 / 3 (purple) in equidistant steps of 1 / 16 . For clarity, the Q 1 axis is repeated on the right.
Psf 05 00022 g002
Figure 3. Parallel axis plot of the Variational Principle solutions: M VP , MaxEnt is blue, M VP , LeastSq is green, M VP , MaxLogs is orange, and M VP , MaxSqrt is red.
Figure 3. Parallel axis plot of the Variational Principle solutions: M VP , MaxEnt is blue, M VP , LeastSq is green, M VP , MaxLogs is orange, and M VP , MaxSqrt is red.
Psf 05 00022 g003
Figure 4. Wolfram Mathematica code for finding the Lagrange parameters (47) using the Legendre transform as a function of Q 1 .
Figure 4. Wolfram Mathematica code for finding the Lagrange parameters (47) using the Legendre transform as a function of Q 1 .
Psf 05 00022 g004
Figure 5. The sub-manifold S m (blue) is tangent to the manifold S n . The red line is a meridian of longitude and the blue line is a parallel of latitude through the point of tangency.
Figure 5. The sub-manifold S m (blue) is tangent to the manifold S n . The red line is a meridian of longitude and the blue line is a parallel of latitude through the point of tangency.
Psf 05 00022 g005
Figure 6. Incorrect view of the probability distribution as a vector (green) to the point of tangency in S n , with a coordinate system along the axes.
Figure 6. Incorrect view of the probability distribution as a vector (green) to the point of tangency in S n , with a coordinate system along the axes.
Psf 05 00022 g006
Figure 7. Wolfram Mathematica code for calculating the metric tensor (65).
Figure 7. Wolfram Mathematica code for calculating the metric tensor (65).
Psf 05 00022 g007
Figure 8. Positive orthant S n 1 . In the ψ -coordinate system, the ψ i are orthonormal coordinates.
Figure 8. Positive orthant S n 1 . In the ψ -coordinate system, the ψ i are orthonormal coordinates.
Psf 05 00022 g008
Table 1. Probability table: version 1.
Table 1. Probability table: version 1.
Probability Table
Blue eyesGreen eyes
Right-handed 3 / 4
Left-handed
1 / 3 1
Table 2. Probability table: version 2.
Table 2. Probability table: version 2.
Probability Table
Blue eyesGreen eyes
Right-handed Q 1 Q 2 3 / 4
Left-handed Q 3 Q 4 1 / 4
1 / 3 2 / 3 1
Table 3. Probability table, version 3.
Table 3. Probability table, version 3.
Probability Table
Blue eyesGreen eyes
Right-handed 1 / 4 1 / 2 3 / 4
Left-handed 1 / 12 1 / 6 1 / 4
1 / 3 2 / 3 1
Table 4. Probability table, version 4.
Table 4. Probability table, version 4.
Probability Table
Blue eyesGreen eyes
Right-handed 1 / 12 Q 1 1 / 3 3 / 4 Q 1 3 / 4
Left-handed 1 / 3 Q 1 1 / 12 + Q 1 1 / 4
1 / 3 2 / 3 1
Table 5. Sivia’s four variational functions: entropy, sum of squares, sum of logarithms, sum of square roots.
Table 5. Sivia’s four variational functions: entropy, sum of squares, sum of logarithms, sum of square roots.
2cVariational Functions
Function
Maximum entropy i = 1 n Q i log Q i
Least squares i = 1 n Q i 2
Maximum logarithms i = 1 n log Q i
Maximum square roots i = 1 n Q i
Table 6. The Variational Principle solutions.
Table 6. The Variational Principle solutions.
Variational Functions
ModelFunction Q i
M VP , MaxEnt i = 1 n Q i log Q i ( 0.25 , 0.50 , 0.08 , 0.17 )
M VP , LeastSq i = 1 n Q i 2 ( 0.29 , 0.46 , 0.04 , 0.21 )
M VP , MaxLog i = 1 n log Q i ( 0.23 , 0.52 , 0.11 , 0.14 )
M VP , MaxSqrt i = 1 n Q i ( 0.24 , 0.51 , 0.10 , 0.15 )
Table 7. The n cells of the state space uniquely number the joint statements.
Table 7. The n cells of the state space uniquely number the joint statements.
State Space Table
Blue eyesGreen eyes
Right-handed X = x 1 X = x 2 3 / 4
Left-handed X = x 3 X = x 4 1 / 4
1 / 3 2 / 3 1
Table 8. The constraint function F X = x i is a function defined on the space of joint statements.
Table 8. The constraint function F X = x i is a function defined on the space of joint statements.
State Space Table
Blue eyesGreen eyes
Right-handed F X = x 1 F X = x 2 3 / 4
Left-handed F X = x 3 F X = x 4 1 / 4
1 / 3 2 / 3 1
Table 9. The constraint function F 1 for the constraint “ 3 / 4 of the kangaroos are right-handed.”
Table 9. The constraint function F 1 for the constraint “ 3 / 4 of the kangaroos are right-handed.”
State Space Table
Blue eyesGreen eyes
Right-handed F 1 X = x 1 = 1 F 1 X = x 2 = 1 3 / 4
Left-handed F 1 X = x 3 = 0 F 1 X = x 4 = 0 1 / 4
1 / 3 2 / 3 1
Table 10. The constraint function vector for the second constraint ( 1 / 3 of the kangaroos have blue eyes).
Table 10. The constraint function vector for the second constraint ( 1 / 3 of the kangaroos have blue eyes).
State Space Table
Blue eyesGreen eyes
Right-handed F 2 X = x 1 = 1 F 2 X = x 2 = 0 3 / 4
Left-handed F 2 X = x 3 = 1 F 2 X = x 4 = 0 1 / 4
1 / 3 2 / 3 1
Table 11. The numerical details for the variational principle solutions.
Table 11. The numerical details for the variational principle solutions.
Variational Functions
ModelFunction Q i ρ F 1 , F 2 H ( Q ) (bits)
M VP , MaxEnt i = 1 n Q i log Q i ( 0.25 , 0.50 , 0.08 , 0.17 ) 0.00 1.730
M VP , LeastSq i = 1 n Q i 2 ( 0.29 , 0.46 , 0.04 , 0.21 ) 0.20 1.697
M VP , MaxLog i = 1 n log Q i ( 0.23 , 0.52 , 0.11 , 0.14 ) 0.11 1.721
M VP , MaxSqrt i = 1 n Q i ( 0.24 , 0.51 , 0.10 , 0.15 ) 0.07 1.727
Table 12. F 3 X = x i selects the interaction between “right-handed” and “blue eyes”.
Table 12. F 3 X = x i selects the interaction between “right-handed” and “blue eyes”.
State Space Table
Blue eyesGreen eyes
Right-handed F 3 X = x 1 = 1 F 3 X = x 2 = 0 3 / 4
Left-handed F 3 X = x 3 = 0 F 3 X = x 4 = 0 1 / 4
1 / 3 2 / 3 1
Table 13. MEP-solution of the kangaroo problem.
Table 13. MEP-solution of the kangaroo problem.
F j λ j Q i
M MEP , MaxEnt ( 0.75 , 0.33 , 0.25 ) ( 1.10 , 0.69 , 0.00 ) ( 0.25 , 0.50 , 0.08 , 0.17 )
M MEP , LeastSq ( 0.75 , 0.33 , 0.29 ) ( 0.79 , 1.61 , 1.16 ) ( 0.29 , 0.46 , 0.04 , 0.21 )
M MEP , MaxLog ( 0.75 , 0.33 , 0.23 ) ( 1.29 , 0.31 , 0.53 ) ( 0.23 , 0.52 , 0.11 , 0.14 )
M MEP , MaxSqrt ( 0.75 , 0.33 , 0.24 ) ( 1.21 , 0.46 , 0.31 ) ( 0.24 , 0.51 , 0.10 , 0.15 )
Table 14. The Kullback–Leibler divergence K L p q (bits) between the models M k , where p and q are the models in the rows and columns, respectively.
Table 14. The Kullback–Leibler divergence K L p q (bits) between the models M k , where p and q are the models in the rows and columns, respectively.
M MaxEnt M LeastSq M MaxLog M MaxSqrt
M MaxEnt 0 0.0368 0.0085 0.0030
M LeastSq 0.0327 0 0.0729 0.0547
M MaxLog 0.0087 0.0834 0 0.0014
M MaxSqrt 0.0031 0.0623 0.0014 0
Table 15. The distance D (in degree) between the models M k .
Table 15. The distance D (in degree) between the models M k .
M MaxEnt M LeastSq 12.5
M MaxEnt M MaxLog 6.3
M MaxEnt M MaxSqrt 3.7
M LeastSq M MaxLog 18.8
M LeastSq M MaxSqrt 16.3
M MaxLog M MaxSqrt 2.5
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bontekoe, R.; Stokes, B.J. Kangaroos in Cambridge. Phys. Sci. Forum 2022, 5, 22. https://doi.org/10.3390/psf2022005022

AMA Style

Bontekoe R, Stokes BJ. Kangaroos in Cambridge. Physical Sciences Forum. 2022; 5(1):22. https://doi.org/10.3390/psf2022005022

Chicago/Turabian Style

Bontekoe, Romke, and Barrie J. Stokes. 2022. "Kangaroos in Cambridge" Physical Sciences Forum 5, no. 1: 22. https://doi.org/10.3390/psf2022005022

APA Style

Bontekoe, R., & Stokes, B. J. (2022). Kangaroos in Cambridge. Physical Sciences Forum, 5(1), 22. https://doi.org/10.3390/psf2022005022

Article Metrics

Back to TopTop