Double Contingency of Communications in Bayesian Learning

: In previous work, we described the geometry of Bayesian learning on a manifold. In this paper, inspired by the notion of modiﬁed double contingency of communications from sociologist Niklas Luhmann, we take two manifolds in equal parts and a potential function on their product to set up mutual Bayesian learning. Particularly, given a parametric statistical model, we consider mutual learning between two copies of the parameter space. Here, we associate the potential with the relative entropy (i.e., the Kullback–Leibler divergence). Although the mutual learning forgets all elements about the model except the relative entropy, it still substitutes for the usual Bayesian estimation of the parameter in a certain case. We propose it as a globalization of the information geometry.


Introduction
This is the sequel of the author's research [1] on the geometry of Bayesian learning. We introduce mutual Bayesian learning by taking two manifolds, each of which is the parameter space of a family of density functions on the other. This setting has the following background in sociology that seems more ideological than practical.
Talcott Parsons [2] introduced the notion of double contingency in sociology. Here, the contingency is that no event is necessary and no event is impossible. A possible understanding of this definition appeals to probability theory. Specifically, even an event with probability P = 1 does not always occur, and even that with P = 0 sometimes occurs, as a non-empty null set appears, at least conceptually. We consider the contingency as the subjective probabilistic nature of society. In fact, updating the conceptual subjective probability according to Bayes' rule should be a response to the conventional contingency that the prior probability is not a suitable predictor in reality. However, the double contingency is not straightforward, as it concerns mutually dependent social actions. In this article, we describe the double contingency by means of Bayesian learning. In our description, when one learns from another, the opposite learning also proceeds. This implies that, in contrast to sequential games such as chess, the actions in a double contingency have to be selected at once. Niklas Luhmann [3] leveraged this simultaneity to regard people not as individuals but as a single agent that he called a system. This further enabled him to apply the double contingency to any communications between systems. We introduce a function λ on the product of two manifolds to understand his systems theory.
From a practical perspective, we consider a family {h x : W → R >0 } x∈X of probability density on a manifold W and regard the parameter space X as a manifold. The product X × X carries the function ϕ : X × X → R ≥0 induced from the relative entropy. Recall that the information geometry [4] is a differential geometry on the diagonal set ∆ ⊂ X × X, which deals with the 3-jet of ϕ at ∆. The author [5] began exploring the global geometry of (X × X, ϕ). Now we take exp(−ϕ) as the above function λ, and show that the mutual Bayesian learning between two copies of X substitutes for the original Bayesian estimation on W in a certain case. We notice that the global geometry of ϕ, as well as the information geometry, forgets the original problem on W, and addresses a related problem on X. In this regard, our mutual Bayesian learning is a globalization of the information geometry.

Geometric Bayesian Learning
We work in the C ∞ -smooth category. Take a possibly non-compact and possibly disconnected manifold X equipped with a volume form dvol X . Note that a discrete set is a 0-dimensional manifold on which a positive function is a volume form. Suppose that each point x of the manifold X presents a possible action of a person. A positive function f : X → R >0 on X is called a density. If its integral | f | dvol X := X f dvol X is finite, it defines the probability f /| f | dvolX on X. Suppose that the selection of an action x is weighted by a density f 0 on X. In our story, the person believes that a density ρ x : Y → R >0 on another manifold (Y, dvol Y ) depends on his action x. That is why the person perceives a given point y 0 ∈ Y by multiplying the density f 0 by the function which is called the likelihood of the datum y 0 ∈ Y. The perception updates the prior density f 0 to the posterior density the Bayesian posterior probability provided that f 0 /| f 0 | dvol X is the prior probability. The only change from the description in [1] is the aim of the learning, i.e., prediction is replaced with action. Although the word action has an active meaning, an activity consisting of countless actions would be a chain of automatic adaptations to the environment.

Mutual Learning
It is natural to symmetrize the above setting by altering the roles of X and Y. Specifically, we further suppose that a point y of the second manifold Y parameterizes a density ρ y : X → R >0 of the first manifold X, and the perception of a datum x 0 ∈ X by the second person updates a prior density g 0 : Y → R >0 on the second manifold to the posterior density g 1 (y) = ρ y (x 0 )g 0 (y). This models the double contingency of Parsons [2]. We further modify it as follows. Fix volume forms dvol X , dvol Y , and dvol X×Y on X, Y, and X × Y, respectively. Take densities f 0 : X → R >0 , g 0 : Y → R >0 , and λ : X × Y → R >0 . Suppose that prior densities f 0 and g 0 , respectively, changes to the posterior densities f 1 = λ(·, y 0 ) f 0 : x → λ(x, y 0 ) f 0 (x) and g 1 = λ(x 0 , ·)g 0 : y → λ(x 0 , y)g 0 (y).
This models the double contingency of Luhmann [3]. We say that f 0 is coupled with g 0 in the mutual learning through Luhmann's potential λ on the product X × Y. Since the potential λ is also a density, it can be coupled with a density σ 0 on another manifold Z. Specifically, if there is a datum ((x, y) 0 , z 0 ) and a density τ 0 : (X × Y) × Z → R >0 , the pair of two persons can change the tendency of its action selection. This mathematics enables us to consider the double contingency not only between persons but also between systems. Here we suppose that the points (x, y) 0 and (x 0 , y 0 ) are given as the same point. We emphasize that what we are discussing is not how the datum appears objectively, but how we perceive it or how we learn from it subjectively. We discuss in Section 4 the discordance between (x, y) 0 and (x 0 , y 0 ) to understand a proposition in Luhmann's systems theory saying that no system is a subsystem.

Relative Entropy
Shannon [6] introduced the notion of entropy in information theory. As for continuous distributions, Jaynes [7] pointed out that the notion of relative entropy is rather foundational to the notion of entropy Indeed, the entropy takes all real values even for normal distributions, whereas the relative entropy is non-negative for any pair of distributions, where the non-negativity is obvious from log(1/t) ≥ 1 − t and is called the Gibbs inequality. Further, if we multiply the volume form by a positive constant, the entropy changes while the relatrive entropy does not. In any case, putting f = f 1 / f 0 and using the volume form f 0 dvol X , we have Note that we cannot put f 0 = 1 unless dvol X is finite. If we multiply the volume form by a non-constant density, the relative entropy varies in general. We notice that the choice of the volume forms in the above mutual learning does not affect the result of the learning.

Mutual Learning via Relative Entropy
The information geometry [4], as well as its partial globalization by the author [1,5], starts with a family of probability distributions. Slightly more generally, we consider a manifold W equipped with a volume form dvol W and a family {h x } x∈X of densities with finite total masses on it. We regard the parameter space X as a manifold, and define the function ϕ : X × X → R ≥0 on its square by The information geometry focuses on the 3-jet of ϕ at the diagonal set ∆ ⊂ X × X. From the Gibbs inequality, the symmetric quadratic tensor defined by the 2-jet of ϕ is positive semi-definite. If it is positive definite, it defines a Riemannian metric called the Fisher-Rao metric. Then the symmetric cubic tensor defined by the 3-jet of the anti-symmetrization ϕ(x, y) − ϕ(y, x) directs a line of torsion-free affine connections passing through the Levi-Civita connection of the Fisher-Rao metric. This line of connections is the main subject of the information geometry. On the other hand, developing the global geometry in [1,5], we define Luhmann's potential for mutual learning between two copies of X as We couple a prior density f 0 : X → R >0 on the first factor with a prior density g 0 : X → R >0 on the second factor through the potential λ on the product X × X. Here, the mutual learning updates f 0 and g 0 , respectively, to the posterior densities Note that the function ϕ changes if we multiply the volume form dvol W by a non-constant density in general. Thus, the choice of the volume form is crucial. The volume form dvol X might be related to the Fisher-Rao metric, although the choice of dvol X is indeed irrelevant to the mutual learning. We can also imagine that the other volume forms dvol X×X and dvol W have been determined in earlier mutual learnings "connected" to the current one.

Results
We address the following problem in certain cases below.

Problem 1.
Does the mutual learning via the relative entropy substitute for the conventional Bayesian estimation of the parameter of the family {h x }? Remark 1. The mutual learning uses only the relative entropy, whereas the conventional Bayesian estimation needs all the information about the family. Thus, Problem 1 also asks if the mutual learning can "sufficiently restore" the family from the relative entropy. To clarify this point, we use the constant 1 as the formal prior density in the sequel even when the total volume is infinite. Then one may compare the family with the particular posterior g 1 to see "how much" it is restored.

Categorical Distributions
Let W be a 0-dimensional manifold with N + 1 unit components, i.e., W = {0, . . . , N} with volume form dvol W = 1. A point x of the open N-simplex with the standard volume form dvol X presents a categorical distribution (i.e., a finite distribution) on W. We take the product manifold X × X with Luhmann's potential Suppose that the prior densities are the constants f 0 (x) ≡ 1 and g 0 (y) ≡ 1 on the first and second factors of X × X. Then, the iteration of mutual Bayesian learning yields f n (x) = exp nx 0 log y 0 + · · · + nx N log y N − nx 0 log x 0 − · · · − nx N log x N , g n (y) ∝ exp nx 0 log y 0 + · · · + nx N log y N where the overlines denote arithmetic means x 0 = x 0 0 + · · · + x 0 n−1 n etc.

Proposition 1.
We have the following maximum a posteriori (MAP) estimations: x 0 : · · · : x N = exp(log y 0 ) : · · · : exp(log y N ) ⇒ f n (x) = max f n y = x = (x 0 , · · · , x N ) ⇒ g n (y) = max g n We notice that the probability g n /|g n | dvol X for the posterior density g n on the second factor of X × X is known as the Dirichlet distribution.
>0 is presented by the probability f /| f | dvolX on the open N-simplex X ⊂ R N+1 for the density In particular, the constant Dir(1, · · · , 1) is called the flat Dirichlet distribution.
Here we believe that the data x k ∈ X obey the probability λ(x, y k )/|λ(x, y k )| dvol X , which we can consider as a continuous version of the categorical distribution. Imagine that a coarse graining of the data x k on X yields data x k obeying a categorical distribution on the 0-skeleton W of the closure of X. Then the probability g n /|g n | dvolX for the new data x k reaches the posterior probability of the conventional Bayesian learning.
The following is the summary of the above.

Theorem 1.
Instead of the conventional Bayesian learning from categorical data, we consider the mutual learning on the product of two copies of the space of categorical distributions via the relative entropy. Then a coarse graining of the data of the first factor into the 0-skeleton of the closure of the domain deforms the second factor of the mutual learning into the conventional Bayesian learning.
Thus, the answer to Problem 1 is affirmative in this case.

Normal Distributions
In the case where X is the space of normal distributions, we would like to change the coordinates of the second factor of the product X × X to make the expression simpler, although one can reach the same result through a straightforward calculation.

The Coordinate System
Let X be the upper-half plane {(m, s) | m ∈ R, s ∈ R >0 } and W the line {w | w ∈ R}. Suppose that any point (m, s) of X presents the normal distribution N(m, s 2 ) on W with mean m and standard deviation s. The relative entropy is expressed as This implies that the Fisher-Rao metric is the half of the Poincaré metric. We put and consider the symplectic product (X, dvol X ) × (X, dvol X ) = (X × X , dvol X − dvol X ).
In [5], the author fixed the Lagrangian correspondence which is the graph of the symplectic involution The function D enjoys symplectic/contact geometric symmetry as well as the submanifold N. See [1] for the multivariate versions of D and N with Poisson geometric symmetry.

The Mutual Learning
In the above setting, we define Luhmann's potential by Put f 0 (m, s) ≡ 1 and g 0 (M, S) ≡ 1. Then, the iteration of the mutual learning yields Its density form is the volume form with unit total mass, which is proportional to Using our volume form dvol X , we can write the density form of NIG(µ, ν, α, β) as This is proportional to g n dvol X on the second factor of X × X when (µ, ν, α, β) = m, n, n + 1 2 , n(m 2 − m 2 + s 2 ) 2 .
We identify the line W with the boundary of X. The conventional Bayesian learning of the normal data m 0 , . . . , m n−1 yields the posterior NIG m, n, n + 1 2 , n(m 2 − m 2 ) 2 provided that the prior is formally 1. Thus, we have the following result similar to Theorem 1.

Theorem 2.
Instead of the conventional Bayesian learning from normal data on R, we consider the mutual learning on the product of two copies of the space X of normal distributions via the relative entropy. Then a coarse graining of the data of the first factor into the boundary ∂X = R by taking s → 0 deforms the second factor of the mutual learning into the conventional Bayesian learning.
Thus, the answer to Problem 1 is also affirmative in this case.

Von Mises Distributions with Fixed Concentration in Circular Case
A von Mises distribution M k (m) with a fixed large concentration k( 1) is a circular analogue of a normal distribution with a fixed small variance that is parametrized by a point m of X = R/2πZ. Its density is proportional to the restriction of the function exp(k cos(m)x + k sin(m)y) to the circle W = {(x, y) | x = cos w, y = sin w, w ∈ R/2πZ} with dvol W = dw. Then, using the easy formula 2π 0 exp(k cos x) sin x dx = 0, we obtain the following expression of the relative entropy: where c is a positive constant. (When k ∈ Z, using modified Bessel, we have c = kI 1 (k) . We put f 0 (m) ≡ 1 and g 0 (m ) ≡ 1. Then, the iteration of mutual Bayesian learning on the torus X × X yields On the other hand, the conventional Bayesian learning on W yields the posterior probability density proportional to exp nkcos(m) cos w + nksin(m) sin w , which looks like g n (w). This suggests the affirmative answer to Problem 1.

Conclusions
We have observed that the answer to Problem 1 is affirmative in some cases. Specifically, the mutual Bayesian learning covers at least a non-empty area of parametric statistics. The author expects that it could cover the whole from some consistent perspective.

On Socio-Cybernetics
In our setup of mutual learning, a system must be organized as the product of two manifolds with Luhmann's potential before each member learns. Further, the potential is the result of an earlier mutual learning in which the system was a member. In Luhmann's description [3], the unit of society is not the agent of an action but a communication or rather a chain of communications. In mathematics, a manifold is locally a product of manifolds and is characterized as the algebraic system of functions on it. By analogy, Luhmann's society seems to be a system of relations between certain systems of functions. Some authors criticize his theory for failing to acquire individual identity, but an individual is a relation between identities that are already represented by manifolds.
As a matter of course, reality cannot be explained by theories. Instead, a theory which can better explain something on reality is chosen. In Section 2.2, we have assumed that (x, y) 0 and (x 0 , y 0 ) are given as the same point in reality. Then there are two possibilities: (1) The potential λ is updated by using (x, y) 0 as a component of a datum, or (2) the mutual learning of (x 0 , y 0 ) is performed under the undated potential λ. The discordance between (1) and (2) does not affect the reality. Further, there is no consistent hierarchy among Luhmann's systems that choose either (1) or (2), and therefore there is no system that is a proper subsystem of another system. Perhaps, the social system chooses either (1) or (2), which can better explain the "fact" in relation to other "facts" in a story on reality. Undertaking all of the above, the notion of autopoiesis that Maturana and Varela [8] found in living organisms can be the foundation of Luhmann's socio-cybernetics.

On the Total Entropy
In objective probability theory, one considers a continuous probability distribution as the limit of a family of finite distributions presented by relative frequency histograms and the entropy of the limit as the limit of the entropies. Since the entropy of a finite distribution whose support is not a singleton is positive, a distribution with negative entropy, e.g., a normal distribution with small variance, does not appear. On the other hand, we take the position of subjective probability theory, and regard a positive function on a manifold that has the unit mass with respect to a fixed volume form as a probability density. From our point of view, the relative entropy between two probability densities is essential as it is non-negative; it presents the information gain; and it does not change (while even the sign of entropy does change) by multiplying the volume form by any positive constant. We notice that an objective probability is a subjective probability, and not vice-versa.
We know that the lowest entropy at the beginning of the universe must be relative to higher entropy in the future. In this regard, the total amount of information decreases as the order of time. However, it is still possible that the amount of consumable information increases, and perhaps that is how this world works. Here we would like to distinguish the world from the universe, even though they concern the same reality and therefore communicate with each other. The world consists of human affairs, including the possible variations of knowledge on facts in the universe-there is no love in the universe, but love is the most important consumable thing in the world. We consider that the notion of complexity in Luhmann's systems theory concerns such consumability as it relates to coupling of systems. Now the problem is not the total reserve of information, but how to strike it and refine it like oil. At present, autopoiesis is gaining ground against mechanistic cybernetics. Our research goes against this stream: Its goal is to invent a learning machine to exploit information resources to be consumed by humans and machines.

On Geometry
In this paper, we have quickly gone from the general definition of mutual learning to a discussion of the special mutual learning via relative entropy. However, it may be worthwhile to stop and study various types of learning according to purely geometric interests. For example, the result of previous work [1] is apparently related to the geometry of dual numbers, and fortunately this special issue includes a study [9] on a certain pair of dual number manifolds. Considering mutual learning for pairs of related manifolds such as this is something to be investigated in the future.
In addition, in proceeding to the case of the mutual learning via relative entropy, one basic problem was left unaddressed: Given a non-negative function ϕ on a squared manifold M × M that takes zero on the diagonal set, can we take a family of probability densities with parameter space M so that the relative entropy induces the function ϕ?
Funding: This research received no external funding. Data Availability Statement: Not applicable.

Conflicts of Interest:
The author declares no conflict of interest.