Abstract
In previous work, we described the geometry of Bayesian learning on a manifold. In this paper, inspired by the notion of modified double contingency of communications from sociologist Niklas Luhmann, we take two manifolds in equal parts and a potential function on their product to set up mutual Bayesian learning. Particularly, given a parametric statistical model, we consider mutual learning between two copies of the parameter space. Here, we associate the potential with the relative entropy (i.e., the Kullback–Leibler divergence). Although the mutual learning forgets all elements about the model except the relative entropy, it still substitutes for the usual Bayesian estimation of the parameter in a certain case. We propose it as a globalization of the information geometry.
1. Introduction
This is the sequel of the author’s research [1] on the geometry of Bayesian learning. We introduce mutual Bayesian learning by taking two manifolds, each of which is the parameter space of a family of density functions on the other. This setting has the following background in sociology that seems more ideological than practical.
Talcott Parsons [2] introduced the notion of double contingency in sociology. Here, the contingency is that no event is necessary and no event is impossible. A possible understanding of this definition appeals to probability theory. Specifically, even an event with probability does not always occur, and even that with sometimes occurs, as a non-empty null set appears, at least conceptually. We consider the contingency as the subjective probabilistic nature of society. In fact, updating the conceptual subjective probability according to Bayes’ rule should be a response to the conventional contingency that the prior probability is not a suitable predictor in reality. However, the double contingency is not straightforward, as it concerns mutually dependent social actions. In this article, we describe the double contingency by means of Bayesian learning. In our description, when one learns from another, the opposite learning also proceeds. This implies that, in contrast to sequential games such as chess, the actions in a double contingency have to be selected at once. Niklas Luhmann [3] leveraged this simultaneity to regard people not as individuals but as a single agent that he called a system. This further enabled him to apply the double contingency to any communications between systems. We introduce a function on the product of two manifolds to understand his systems theory.
From a practical perspective, we consider a family of probability density on a manifold W and regard the parameter space X as a manifold. The product carries the function induced from the relative entropy. Recall that the information geometry [4] is a differential geometry on the diagonal set , which deals with the 3-jet of at . The author [5] began exploring the global geometry of . Now we take as the above function , and show that the mutual Bayesian learning between two copies of X substitutes for the original Bayesian estimation on W in a certain case. We notice that the global geometry of , as well as the information geometry, forgets the original problem on W, and addresses a related problem on X. In this regard, our mutual Bayesian learning is a globalization of the information geometry.
2. Mathematical Formulation
2.1. Geometric Bayesian Learning
We work in the -smooth category. Take a possibly non-compact and possibly disconnected manifold X equipped with a volume form . Note that a discrete set is a 0-dimensional manifold on which a positive function is a volume form. Suppose that each point x of the manifold X presents a possible action of a person. A positive function on X is called a density. If its integral is finite, it defines the probability on X. Suppose that the selection of an action x is weighted by a density on X. In our story, the person believes that a density on another manifold depends on his action x. That is why the person perceives a given point by multiplying the density by the function
which is called the likelihood of the datum . The perception updates the prior density to the posterior density . Indeed is the Bayesian posterior probability provided that is the prior probability. The only change from the description in [1] is the aim of the learning, i.e., prediction is replaced with action. Although the word action has an active meaning, an activity consisting of countless actions would be a chain of automatic adaptations to the environment.
2.2. Mutual Learning
It is natural to symmetrize the above setting by altering the roles of X and Y. Specifically, we further suppose that a point y of the second manifold Y parameterizes a density of the first manifold X, and the perception of a datum by the second person updates a prior density on the second manifold to the posterior density . This models the double contingency of Parsons [2]. We further modify it as follows. Fix volume forms , , and on X, Y, and , respectively. Take densities , , and . Suppose that prior densities and , respectively, changes to the posterior densities
This models the double contingency of Luhmann [3]. We say that is coupled with in the mutual learning through Luhmann’s potential on the product . Since the potential is also a density, it can be coupled with a density on another manifold Z. Specifically, if there is a datum and a density , the pair of two persons can change the tendency of its action selection. This mathematics enables us to consider the double contingency not only between persons but also between systems. Here we suppose that the points and are given as the same point. We emphasize that what we are discussing is not how the datum appears objectively, but how we perceive it or how we learn from it subjectively. We discuss in Section 4 the discordance between and to understand a proposition in Luhmann’s systems theory saying that no system is a subsystem.
2.3. Relative Entropy
Shannon [6] introduced the notion of entropy in information theory. As for continuous distributions, Jaynes [7] pointed out that the notion of relative entropy
is rather foundational to the notion of entropy
Indeed, the entropy takes all real values even for normal distributions, whereas the relative entropy is non-negative for any pair of distributions, where the non-negativity is obvious from and is called the Gibbs inequality. Further, if we multiply the volume form by a positive constant, the entropy changes while the relatrive entropy does not. In any case, putting and using the volume form , we have
Note that we cannot put unless is finite. If we multiply the volume form by a non-constant density, the relative entropy varies in general. We notice that the choice of the volume forms in the above mutual learning does not affect the result of the learning.
2.4. Mutual Learning via Relative Entropy
The information geometry [4], as well as its partial globalization by the author [1,5], starts with a family of probability distributions. Slightly more generally, we consider a manifold W equipped with a volume form and a family of densities with finite total masses on it. We regard the parameter space X as a manifold, and define the function on its square by
The information geometry focuses on the 3-jet of at the diagonal set . From the Gibbs inequality, the symmetric quadratic tensor defined by the 2-jet of is positive semi-definite. If it is positive definite, it defines a Riemannian metric called the Fisher–Rao metric. Then the symmetric cubic tensor defined by the 3-jet of the anti-symmetrization directs a line of torsion-free affine connections passing through the Levi-Civita connection of the Fisher–Rao metric. This line of connections is the main subject of the information geometry. On the other hand, developing the global geometry in [1,5], we define Luhmann’s potential for mutual learning between two copies of X as
We couple a prior density on the first factor with a prior density on the second factor through the potential on the product . Here, the mutual learning updates and , respectively, to the posterior densities
Note that the function changes if we multiply the volume form by a non-constant density in general. Thus, the choice of the volume form is crucial. The volume form might be related to the Fisher–Rao metric, although the choice of is indeed irrelevant to the mutual learning. We can also imagine that the other volume forms and have been determined in earlier mutual learnings “connected” to the current one.
3. Results
We address the following problem in certain cases below.
Problem 1.
Does the mutual learning via the relative entropy substitute for the conventional Bayesian estimation of the parameter of the family ?
Remark 1.
The mutual learning uses only the relative entropy, whereas the conventional Bayesian estimation needs all the information about the family. Thus, Problem 1 also asks if the mutual learning can “sufficiently restore” the family from the relative entropy. To clarify this point, we use the constant 1 as the formal prior density in the sequel even when the total volume is infinite. Then one may compare the family with the particular posterior to see “how much” it is restored.
3.1. Categorical Distributions
Let W be a 0-dimensional manifold with unit components, i.e., with volume form . A point x of the open N-simplex
with the standard volume form presents a categorical distribution (i.e., a finite distribution) on W. We take the product manifold with Luhmann’s potential
Suppose that the prior densities are the constants and on the first and second factors of . Then, the iteration of mutual Bayesian learning yields
where the overlines denote arithmetic means etc.
Proposition 1.
We have the following maximum a posteriori (MAP) estimations:
We notice that the probability for the posterior density on the second factor of is known as the Dirichlet distribution.
Definition 1.
The Dirichlet distribution for is presented by the probability on the open N-simplex for the density
In particular, the constant is called the flat Dirichlet distribution.
We identify the set W with the 0-skeleton of the closure of the open N-simplex . If the prior is the flat Dirichlet distribution , the Bayesian learning from categorical data yields the posterior . This is the conventional Bayesian learning from categorical data. On the other hand, the above probability is the Dirichlet distribution . Here we believe that the data obey the probability , which we can consider as a continuous version of the categorical distribution. Imagine that a coarse graining of the data on X yields data obeying a categorical distribution on the 0-skeleton W of the closure of X. Then the probability for the new data reaches the posterior probability of the conventional Bayesian learning.
The following is the summary of the above.
Theorem 1.
Instead of the conventional Bayesian learning from categorical data, we consider the mutual learning on the product of two copies of the space of categorical distributions via the relative entropy. Then a coarse graining of the data of the first factor into the 0-skeleton of the closure of the domain deforms the second factor of the mutual learning into the conventional Bayesian learning.
Thus, the answer to Problem 1 is affirmative in this case.
3.2. Normal Distributions
In the case where X is the space of normal distributions, we would like to change the coordinates of the second factor of the product to make the expression simpler, although one can reach the same result through a straightforward calculation.
3.2.1. The Coordinate System
Let X be the upper-half plane and W the line . Suppose that any point of X presents the normal distribution on W with mean m and standard deviation s. The relative entropy is expressed as
This implies that the Fisher–Rao metric is the half of the Poincaré metric. We put
and consider the symplectic product . In [5], the author fixed the Lagrangian correspondence
which is the graph of the symplectic involution
Using it, the author took the “stereograph” of the relative entropy as follows. Regard a value D of the relative entropy as a function of the pair of two points and on the first factor of the product ; take the point on the second factor, which N corresponds to the point on the first factor; and regard the value D as the value of a function of the point . That is, the function is defined by
The function enjoys symplectic/contact geometric symmetry as well as the submanifold N. See [1] for the multivariate versions of and N with Poisson geometric symmetry.
3.2.2. The Mutual Learning
In the above setting, we define Luhmann’s potential by
Put and . Then, the iteration of the mutual learning yields
Since , we see that the density reaches the maximum at . Similarly, we can see that the density reaches the maximum when and hold.
Definition 2.
The normal-inverse-Gamma distribution on the upper-half plane equipped with the volume form is the probability density proportional to
Its density form is the volume form with unit total mass, which is proportional to
Using our volume form , we can write the density form of as
This is proportional to on the second factor of when
We identify the line W with the boundary of X. The conventional Bayesian learning of the normal data yields the posterior provided that the prior is formally 1. Thus, we have the following result similar to Theorem 1.
Theorem 2.
Instead of the conventional Bayesian learning from normal data on , we consider the mutual learning on the product of two copies of the space X of normal distributions via the relative entropy. Then a coarse graining of the data of the first factor into the boundary by taking deforms the second factor of the mutual learning into the conventional Bayesian learning.
Thus, the answer to Problem 1 is also affirmative in this case.
3.3. Von Mises Distributions with Fixed Concentration in Circular Case
A von Mises distribution with a fixed large concentration is a circular analogue of a normal distribution with a fixed small variance that is parametrized by a point m of . Its density is proportional to the restriction of the function to the circle with . Then, using the easy formula , we obtain the following expression of the relative entropy:
where c is a positive constant. (When , using modified Bessel, we have .) Thus, Luhmann’s potential is . We put and . Then, the iteration of mutual Bayesian learning on the torus yields
On the other hand, the conventional Bayesian learning on W yields the posterior probability density proportional to , which looks like . This suggests the affirmative answer to Problem 1.
3.4. Conclusions
We have observed that the answer to Problem 1 is affirmative in some cases. Specifically, the mutual Bayesian learning covers at least a non-empty area of parametric statistics. The author expects that it could cover the whole from some consistent perspective.
4. Discussion
4.1. On Socio-Cybernetics
In our setup of mutual learning, a system must be organized as the product of two manifolds with Luhmann’s potential before each member learns. Further, the potential is the result of an earlier mutual learning in which the system was a member. In Luhmann’s description [3], the unit of society is not the agent of an action but a communication or rather a chain of communications. In mathematics, a manifold is locally a product of manifolds and is characterized as the algebraic system of functions on it. By analogy, Luhmann’s society seems to be a system of relations between certain systems of functions. Some authors criticize his theory for failing to acquire individual identity, but an individual is a relation between identities that are already represented by manifolds.
As a matter of course, reality cannot be explained by theories. Instead, a theory which can better explain something on reality is chosen. In Section 2.2, we have assumed that and are given as the same point in reality. Then there are two possibilities: (1) The potential is updated by using as a component of a datum, or (2) the mutual learning of is performed under the undated potential . The discordance between (1) and (2) does not affect the reality. Further, there is no consistent hierarchy among Luhmann’s systems that choose either (1) or (2), and therefore there is no system that is a proper subsystem of another system. Perhaps, the social system chooses either (1) or (2), which can better explain the “fact” in relation to other “facts” in a story on reality. Undertaking all of the above, the notion of autopoiesis that Maturana and Varela [8] found in living organisms can be the foundation of Luhmann’s socio-cybernetics.
4.2. On the Total Entropy
In objective probability theory, one considers a continuous probability distribution as the limit of a family of finite distributions presented by relative frequency histograms and the entropy of the limit as the limit of the entropies. Since the entropy of a finite distribution whose support is not a singleton is positive, a distribution with negative entropy, e.g., a normal distribution with small variance, does not appear. On the other hand, we take the position of subjective probability theory, and regard a positive function on a manifold that has the unit mass with respect to a fixed volume form as a probability density. From our point of view, the relative entropy between two probability densities is essential as it is non-negative; it presents the information gain; and it does not change (while even the sign of entropy does change) by multiplying the volume form by any positive constant. We notice that an objective probability is a subjective probability, and not vice-versa.
We know that the lowest entropy at the beginning of the universe must be relative to higher entropy in the future. In this regard, the total amount of information decreases as the order of time. However, it is still possible that the amount of consumable information increases, and perhaps that is how this world works. Here we would like to distinguish the world from the universe, even though they concern the same reality and therefore communicate with each other. The world consists of human affairs, including the possible variations of knowledge on facts in the universe—there is no love in the universe, but love is the most important consumable thing in the world. We consider that the notion of complexity in Luhmann’s systems theory concerns such consumability as it relates to coupling of systems. Now the problem is not the total reserve of information, but how to strike it and refine it like oil. At present, autopoiesis is gaining ground against mechanistic cybernetics. Our research goes against this stream: Its goal is to invent a learning machine to exploit information resources to be consumed by humans and machines.
4.3. On Geometry
In this paper, we have quickly gone from the general definition of mutual learning to a discussion of the special mutual learning via relative entropy. However, it may be worthwhile to stop and study various types of learning according to purely geometric interests. For example, the result of previous work [1] is apparently related to the geometry of dual numbers, and fortunately this special issue includes a study [9] on a certain pair of dual number manifolds. Considering mutual learning for pairs of related manifolds such as this is something to be investigated in the future.
In addition, in proceeding to the case of the mutual learning via relative entropy, one basic problem was left unaddressed: Given a non-negative function on a squared manifold that takes zero on the diagonal set, can we take a family of probability densities with parameter space M so that the relative entropy induces the function ?
Funding
This research received no external funding.
Data Availability Statement
Not applicable.
Conflicts of Interest
The author declares no conflict of interest.
References
- Mori, A. Global Geometry of Bayesian Statistics. Entropy 2020, 22, 240. [Google Scholar] [CrossRef] [PubMed]
- Parsons, T. The Social System; Free Press: Glencoe, IL, USA, 1951. [Google Scholar]
- Luhmann, N. The autopoiesis of the social system. In Sociocybernetic Paradoxes: Observation, Control and Evolution of Self-Steering Systems; Geyer, R.F., van der Zouwen, J., Eds.; Sage: London, UK, 1986; pp. 172–192. [Google Scholar]
- Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
- Mori, A. Information geometry in a global setting. Hiroshima Math. J. 2018, 48, 291–305. [Google Scholar] [CrossRef]
- Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Jaynes, E. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 227–241. [Google Scholar] [CrossRef]
- Maturana, H.; Varela, F. Autopoiesis and Cognition: The Realization of the Living, Boston Studies in the Philosophy and History of Science 42; Reidel: Dordrecht, The Netherlands, 1972. [Google Scholar]
- Li, Y.; Alluhaibi, N.; Abdel-Baky, R.A. One-parameter Lorentzian dual spherical movements and invariants of the axodes. Symmetry 2022, 14, 1930. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).