Information Processing in the Brain as Optimal Entropy Transport: A Theoretical Approach

We consider brain activity from an information theoretic perspective. We analyze the information processing in the brain, considering the optimality of Shannon entropy transport using the Monge–Kantorovich framework. It is proposed that some of these processes satisfy an optimal transport of informational entropy condition. This optimality condition allows us to derive an equation of the Monge–Ampère type for the information flow that accounts for the branching structure of neurons via the linearization of this equation. Based on this fact, we discuss a version of Murray’s law in this context.


Introduction
The brain as the organ responsible for processing information in the body has been subjected to evolutionary pressure. "The business of the brain is computation, and the task it faces is monumental. It must take sensory inputs from the external world, translate this information into a computationally accessible form, then use the results to make decisions about which course of action is appropriate given the most probable state of the world" [1]. It is therefore arguable that information processing has been optimized, at least to some extent, by natural selection [2,3]. This is a rather abstract claim that should ultimately be contrasted with experiments (in this respect, see [4,5]). In a subsequent work, we explore this question in more detail, but a plausible connection can be established with experimental and theoretical results via fMRI in which different measures of cost have been proposed. The brain as an informational system is the subject of active research (see for instance [6][7][8]). For example, "behavior analysis has adopted these tools [Information theory] as a novel means of measuring the interrelations between behavior, stimuli, and contingent outcomes" [1]. In this context, in [1], it was shown that informational measures of contingency appear to be a reasonable basis for predicting behavior. We also refer the reader to the issue devoted to information theory in neuroscience [9] for a recent account of this perspective. In the work just mentioned, several papers investigate optimization principles (for instance, the maximum entropy principle [10] or the free energy principle; see [11,12]) as tools for understanding inference, coding, and other brain functionalities. Information theory will allow filling the gaps between disciplines, for example psychology, neurobiology, physics, mathematics, and computer science. In the present paper, we adopt a related setting and mathematically formalize information processing in the brain within the framework of optimal transport. The rationale for this is consistent with the view that some essential brain functionalities such as inference and the coordination of tasks (e.g., auditory and motor activities) involve the transportation of information and that such processes should be efficient, have been subjected to evolutionary pressure, and as a consequence, are (pseudo) optimal. As was already pointed out, our theoretical proposal should be contrasted with experimental results.
It is necessary to observe from the very beginning that information processing and transport are of an intrinsically spatiotemporal nature. Therefore, our proposal should include these two features. In doing so, we expect the spatial part of the optimization to give rise to spatial patterns, for instance network-like or branching hierarchical components, as well as the temporal structure, such as periodic or synchronized patterns. However, in this paper, we begin by considering only the spatial part, except for a few general remarks. Since our proposal is an attempt to establish a methodological framework to study informational entropy transport in the brain, we deal with spacial aspects first, leaving a study of the dynamical aspects for a subsequent work. Furthermore, in order to simplify the problem, we consider only the one-dimensional case in the mathematical formalism. However, it is certain that the geometry of the brain has to play an essential role in all the processes [13], and we extrapolate some of the results we obtain to two and three dimensions. Now, we provide an overview of the paper. In Section 2, we present a general framework for information processing in the brain as an optimal transport of entropy, as well as some mathematical results in the context of the Monge-Kantorovich problem. The main idea is that instead of considering that some sort of physical mass is being transported, it is informational entropy. In order to provide the mathematical results and for the sake of completeness, we adapt the material on the existence of a solution in the optimal mass transportation case as presented by Bonnotte [14] to the informational case. We conclude this section with a derivation of the Monge-Ampère equation in the one-dimensional case. We begin Section 3 by recalling the linearization of the Monge-Ampère equation around the square of the distance function, which involves the Laplacian. Then, we argue that adding a nonlinear term is justified by the physiological nature of transmission along neurons. The resulting model is a semilinear elliptic equation. At the end of this section, we relate the qualitative features of the solutions with the branching structure of neural networks in the brain. In other words, we show that the optimal transport process of informational entropy is consistent with the geometric branching structure of neural branching. In Section 4, we elaborate on the relationship of the branching structure of neurons and Murray's law [15], which provides the optimal branching ratio of the father to the daughter branch sections, as well as the optimal bifurcation angle. We propose a modified version of Murray's law when the underlying transport network carries information instead of a fluid. The last section is devoted to concluding remarks, further research, and open questions.

The Monge-Kantorovich Problem
We present a general overview of the results on optimal transportation theory needed in this work. For a complete exposition, see [14,16,17] or [18], but for the sake of completeness, we include a general discussion without giving the proofs of the results, but including appropriate references. Our presentation follows closely [14] and some parts of [16].
The original Monge-Kantorovich problem was formulated in the context of mass transport (Monge) or budget allocation (Kantorovich). In a later section, we will adapt the setting to include the transport of informational entropy.
Monge's problem: Given two probability measures µ and ν on R n and a cost function c : R n × R n −→ [0, ∞], the problem of Monge can be stated as follows: Find T : R n −→ R n , such that ν = T#µ and R n c(x, T(x))dµ(x) is minimal. (1) The condition ν = T#µ means that T transports µ onto ν; that is, ν is the push-forward of µ by T: for any ξ, R n ξ(y)dν(y) = R n ξ(T(x))dµ(x). Monge-Kantorovich problem: Monge's problem might have no solution; hence, it is better to take the following generalization proposed by Leonid Kantorovich: instead of looking for a map, find a measure: where Π(µ, ν) stands for the set of all transport plans between µ and ν, i.e., the probability measures on R n × R n with marginals µ and ν. This problem really extends Monge's problem. For any transport map T, sending µ onto ν yields a measure π ∈ Π(µ, ν), which is given by π = (Id, T)#µ, i.e., the only measure π on R n × R n such that: and the associated costs of transportation are the same. In this version, it is not difficult to show that there is always a solution ( [14] or [16]).

Dual Formulation
There is a duality between the Monge-Kantorovich problem (2) and the following problem: It seems natural to look for a solution of this problem among the pairs (ψ, φ) ∈ P (X × Y) that satisfy: We will write φ(y) = ψ c (y) and ψ(x) = φ c (x).

Definition 1.
A function ψ is said to be c-concave if ψ = φ c for some function φ. In that case, ψ c and φ c are called the c-transform of ψ and φ, respectively. We also say that (ψ, ψ c ) is an admissible pair and ψ, ψ c are admissible potentials.
Then, the problem becomes: The function ψ is called a Kantorovich potential between µ and ν.
The following proposition explains how to relate the Monge-Kantorovich problem (2) with (3), known as the Kantorovich duality principle. Proposition 1 (Kantorovich duality principle). Let µ and ν be Borel probability measures on X and Y ∈ R n , respectively. If the cost function c : R n × R n −→ [0, ∞] is lower semi-continuous and: then there is a Borel map ψ : R n −→ R that is c-concave and optimal for (3). Moreover, the resulting maximum is equal to the minimum of the Monge-Kantorovich problem (2); i.e., min π∈Π(µ,ν) R n ×R n c(x, y)dπ(x, y) = max where, e. x ∈ X and ν − a.e. y ∈ Y .

Proof.
A proof of this result can be found in [16].

Solution in the Real Line: Optimal Transportation Case
For the rest of this section, we only consider the one-dimensional case, as discussed in the Introduction.
Let X and Y be two bounded smooth open sets in R and µ(dx), ν(dy) the probability measures of X and Y, respectively, with µ(dx) = f dx, ν(dy) = gdy, f = 0 in R\X, and g = 0 in R\Y. Let

Proposition 2.
Let h ∈ C 1 (R) be a non-negative, strictly convex function. Let µ and ν be Borel probability measures on R such that: If µ has no atom and F and G stand for the respective cumulative distribution of µ and ν, respectively, then: If π is the induced transform plan, that is: then π is optimal for the Monge-Kantorovich problem (2). (2) can be found in [14].

Proof. A proof of Proposition
In order to get the previous result, one has to consider the functional: where c : X × Y −→ R is some given cost function and Π(x, y) stands for the set of all transport plans between µ and ν, meaning the probability measures on R × R with marginals µ and ν; i.e., or more rigorously: Our goal is to get a similar result to Proposition (2) in the case when entropy transportation is considered instead of mass transportation. This is the content of the next section.

Solution in the Real Line: Optimal Entropy Transportation Case
As was pointed out in [1], "the fundamental measure in information theory is the entropy. It corresponds to the uncertainty associated with a signal. "Entropy" and "information" are used interchangeably, because the uncertainty about an outcome before it is observed, corresponds to the information gained by observing it" (for a review, see [1,19] or [20]). In that context, we will prove the existence of an optimal entropy transport for the cost function c(x, y) satisfying (6) and (7), similarly to the optimal transportation case discussed in the last section. We will also find the Monge-Ampère equation for quadratic cost |x − y| 2 /2 for this optimal entropy transport. A few words are in order regarding the choice of c. From the mathematical perspective, it considerably simplifies the analysis. As a matter of fact, the optimal transportation problem has not been solved for the general case of nonquadratic costs. On the other hand, from the physiological point of view, it is natural to assume that the energy required to send a signal from one point of the brain to another can be taken as a monotone function of the distance.
Let µ and ν be probability measures defined as above, then take X = Y = Ω ∈ R, and let the entropy be characterized by Shannon's proposal: where x ∈ Ω, ρ, andρ are the distribution densities with respect to µ and ν, respectively, in the same way as the formulation of the optimal transportation problem states; i.e., µ = ρdx, ν =ρdy, satisfying (8). We wish (9) to be related to the probability measure µ in Ω. It is natural to think that in passing from the state characterized by (9) to the one characterized by −ρ(y) ln(ρ(y)), where y ∈ Ω and wish it to be related to the probability measure ν in Ω. Similarly, −ρ(x) ln(ρ(x)) and −ρ(y) ln(ρ(y)) will be the marginals of −ρ ρ ρ(x, y) ln(ρ ρ ρ(x, y)). As a first concrete proposal, we consider the following functional: where c is a spatial cost function. Notice that we have dropped the minus sign, so looking for maximal entropy is equivalent to minimizing the previous expression. The problem is then to find the optimal entropy transport strategy between x and y (analogous to Monge's problem). There is however a standard problem, which is the fact that in the continuous case, the entropy could be negative, whereas in the discrete case (i.e., for discrete probability functions), the entropy is always positive. We therefore consider the absolute value of the entropy defined in (9). More precisely, define the following: It is natural to assume that: for some constant K ∈ R + \{0}, and let: with: (11) is a well-defined Borel probability measure on Ω, and: is also a well-defined Borel probability measure on Ω. Then, we can define: with µ and ν given by (11) and (12); or more rigorously: with µ and ν given by (11) and (12); we can take the cumulative distributions of µ and ν, respectively, by F(x) = µ((−∞, x]) and G(y) = ν((−∞, y]) with µ and ν given as before. By definition, they are non-decreasing.
Now, we are in the condition to establish the problem (1) analogous to that of Monge, namely: Find a map T : R −→ R, such that ν = T#µ with µ and ν given by (11) and (12) and and analogous to the Monge-Kantorovich problem: Find a measure π ∈ Π(µ, ν) given by (13) with µ and ν given by (11) and (12) such that Ω×Ω c(x, y) dπ(x, y) is minimal.
Next, we present the equivalent proposition of (2), for the measures given by (11) and (12), namely: Let h ∈ C 1 (R) be a non-negative, strictly convex function. Let µ and ν be Borel probability measures on R given by (11) and (12), respectively. Suppose that: for Ω ∈ R. If µ has no atom and F and G represent the corresponding cumulative distribution functions of µ and ν, respectively, then: If π is the induced entropy transform plan, that is: defined as T#µ(E) = µ(T −1 (E)) for E ∈ Ω, then π is optimal for Problem (15).
Proof. The proof of this result is adapted from [14] and for the sake of completeness is given in Appendix A.
Our next goal is to deduce the Monge-Ampère equation for this case. In order to do that, we will need an analog of the Kantorovich duality principle (Proposition (1)), for the measures given by (11) and (12), namely: Proposition 4. Let µ and ν be the Borel probability measures on Ω ∈ R given by (11) and (12), respectively. If the cost function: is lower semi-continuous and: Ω Ω then there is a Borel map ψ : R −→ R that is c-concave and optimal for (3). Moreover, the resulting maximum is equal to the minimum of Problem (15); i.e., where, and if π ∈ Π(µ, ν) given by (13) is optimal, then ψ(x) + ψ c (y) = c(x, y) almost everywhere for π.
Proof. A proof of this result can be found in Appendix A.
If we propose C(x, y) = |x − y| 2 /2, we wish T to be expressed as T = ∇φ for some convex function φ and then to be able to find the corresponding Monge-Ampère equation related to the measures µ and ν given by (11) and (12). This fact is guaranteed by Brenier's theorem. The details of its proof adapted for our case are important and can be found in Appendix A.
Theorem 1 (Brenier). Let µ and ν be the Borel probability measures on Ω ∈ R given by (11) and (12), respectively, and with finite second-order moments; that is, such that: Then, if µ is absolutely continuous on Ω, there exists a unique T : R −→ R such that ν = T#µ and: with Π(µ, ν) given by (13). Moreover, there is only one optimal transport plan, γ, which is necessarily (Id, T)#µ, and T is the gradient of a convex function ϕ, which is therefore unique up to an additive constant. There is also a unique (up to an additive constant) Kantorovich potential, ψ, which is locally Lipschitz and linked to ϕ through the relation: Proof. See Appendix A.

Observation 2.
Observe that Theorem 1 holds for the general case on R n .

Let
: be two probability measures, absolutely continuous with respect to the Lebesgue measure. By Theorem 1, there exists a unique gradient of a convex function, ϕ, such that: for all test functions ζ ∈ C b (R). Since ϕ is strictly convex, then ∇ϕ is C 0 and one-to-one. Hence, taking y = ∇ϕ(x), we get: From (20) and (21), we get: and the Monge-Ampère equation: corresponding to this case.

Neural Branching Structure and the Linearization of the Monge-Ampère Equation
As was pointed out in the Introduction, the purpose of this section is to propose a model for the branching structure of the neurons, which is consistent with the process of information transport previously introduced. The basic idea is as follows. If we consider that information transport is optimized in some brain processes, we consequently have (as discussed in the previous section) an associated Monge-Ampère equation for the transport plan potential. Besides, it is natural to consider a cost that is close to some power of the distance function, since physiological cost can be taken to depend on the distance traveled by the corresponding signal, as is usually assumed in transport networks. For technical reasons, and in order to be able to adapt results well known in the literature, we take this cost function to be close to: |x − y| 2 2 , From a qualitative perspective, this choice should not change the results much, as long as the cost function remains convex (see [16]). If this is the case, we can then compute the linearization of the Monge-Ampère equation around this quadratic cost function. The resulting equation is a linear elliptic equation. We argue that a self-activating mechanism should be incorporated in the form of a nonlinear term (as a result of the excitable nature of the transport of electric impulses along axons).
In this way, we end up with a semilinear equation that can be used to explain branching processes in biological networks (see [21]). More specifically, the solution to this equation could be associated with the concentration of a morphogen, e.g., a growth factor, and if such a concentration is above a certain threshold, a branching mechanism is triggered. It is then consistent to look for solutions that are close to the quadratic cost function and therefore to linearize around it.
In order to linearize the Monge-Ampère equation, we assume then that ϕ is very close to |x| 2 /2, so ρ(x) ln[ρ(x)] is very close to ρ(∇ϕ(x)) ln[ρ(∇ϕ(x))] . In that case, following [16,22,23], make: and: with η, h ∈ L 1 (µ) and ε 1. We leave the details of this computation for Appendix A. Substituting (24) and (25) in the Monge-Ampère Equation (23), we get as the linearized operator: with: Then, the Laplacian plus a transport term can be seen as the linearized version of the Monge-Ampère equation for our proposal. We notice that the main mechanism responsible for the flow of information along the axons is the propagation of electrical impulses. This is well known to be an excitable process that involves, among others, a self-activating component, as for instance in the standard Hodgkin-Huxley or FitzHugh-Nagumo models. If we include this into the linearization previously obtained (Equation (26)), we get: where F a is function describing the self-activating mechanism and can be taken typically as a power of φ: with p > 1. Solutions of this type of equations have been studied by many authors since the pioneering work by Ni and Takagi ( [24] and the references therein), since they appear in different contexts. These solutions typically exhibit concentrations that can be responsible for branching structures. Indeed, if one assumes that the concentration of a solution to the previous equation is correlated with a growth factor morphogen, then a branch will stem out of the main branch. This or similar models have been proposed using reaction-diffusion models following Turing's original proposal ( [25] or [26]), for pattern formation; in particular for branching structures in plants ( [27,28] or [29]), lungs ( [21]), and other vascular systems ( [30,31]). Figures 1 and 2 show numerical simulations for a particular case of the linearization of the Monge-Ampère equation. Growth is induced by the concentration of the solution, and it can be seen that the process gives rise to lateral branches. Branching occurs when there is the concentration of the solution, the morphogen, above a certain threshold (the color code stands for standard heat maps: red, high; blue, low). This simulation was provided by Jorge Castillo-Medina and developed in COMSOL. For more details, the reader is referred to [21] and the references therein.

Murray's Law and Neural Branching
In the previous section, we argued that neuronal branching is compatible with reaction diffusion processes. On the other hand, we deduced the corresponding equations by considering the transport of information along neural networks. The question of whether there is some connection with Murray's law arises naturally. Recall that Murray's law refers to a transport network ( [32,33]).
In his original paper [32], Murray obtained from optimizing considerations a relationship for the different parameters associated with a branching transport network that was later generalized in [15]. In what follows, we deduce it from scratch for the sake of completeness (we refer the reader to [15,34,35] for further details). The total power required for the flow to overcome the viscous drag is described by: where µ is the dynamic viscosity of the fluid, L is the vessel length, f is the volumetric flow rate, m is an all-encompassing metabolic coefficient that includes the chemical cost of keeping the blood constituents fresh and functional and the general cost owing to the weight of the blood and the vessel, and r is the vessel radius. For our purposes, we modify Equation (27) as follows: where the first term corresponds to the power required for an electrical impulse to propagate along the axon. Notice in particular that the factor of r 2 in the denominator follows from the fact that electrical resistance is inversely proportional to the area of the section of the conducting material.
On the other hand, the second term is proportional to a power, α, of the radius and describes the fact that metabolic cost can vary depending on the type of neurons with which we are dealing. For instance, the degree of myelination of the axon could determine the effective cost associated with information transport. The minimum power is found by differentiating with respect to r and equating to zero: With this, the optimal radius: and the optimal relation between volumetric flow rate and vessel radius, such that the power requirement is minimized, is obtained with: where k = αb 2a . Using the construction of [33], if the radius of the main branch (r 0 ), lateral branches (r 1 , r 2 ), and x and y are the angles between the lateral branches and the main branch, we obtain a generalized version of Murray's law: thus, we get the general law: for α ∈ R. Using these relations, we obtain three general equations associated with the branching angles (see Figure 3) x, y, and x + y: which correspond to a different generalized Murray's law for different values of α ∈ R.  , which corresponds to Murray's original proposal [33], which states that the angle in the bifurcation of an artery should not be less than 75 • (74.9 • to be more exact). This is consistent with the numerical and experimental results in [36]. If α = 6, we get cos(x + y) = 0 and then x + y = π 2 . On the other hand, cos(x) = r 2 1 /r 2 0 , then cos(x) > 0 since r 0 , r 1 = 0; this implies that cos(x) = 0 and x = π 2 . Similarly, y = π 2 . We obtain that x + y = π 2 and x, y ∈ (0, π 2 ) for this case. In other words, for α = 6, the angle between the bifurcated branches is π/2, but orthogonal branching is ruled out.
We conclude then that the relevant values for our purposes are for α ∈ [2,6]. It would be interesting to contrast these possible scenarios with experimental data for different kinds of nervous tissues. To our knowledge, no systematic experimental study of branching angles has been carried out.

Conclusions
We proposed that information flow in some brain processes can be analyzed in the framework of optimal transportation theory. A Monge-Ampère equation was obtained for the optimal transportation plan potential in the one-dimensional case. Extrapolating to higher dimensions, the corresponding linearization around a quadratic distance cost was derived and shown to be consistent with the branching structure of the nervous system. Finally, a generalized version of Murray's law was derived assuming different cost functions, depending on a parameter related to the metabolic maintenance term. Future work includes a detailed comparison of the methodological proposal with experimental data. In particular, it would be interesting to carry out the program here proposed in a concrete cognitive experiment. Possible concrete experiments to compare with could be found in [2,3,5]. Here, we outline a simple procedure with fMRI data in which a direct connection with optimal transport theory can be tested. Consider the brain activity map for the resting state given by standard fMRI. Once normalized, this map will provide the initial probability density and entropy to be transported. In fact, in [37], another possible methodology for measuring the entropy with fMRI can be found. Later on, the subject is asked to perform a simple motor task, for instance move the right hand. The corresponding density after the task is done can then be registered as before and will provide the final density in the optimal transport problem. Some intermediate densities should be determined as well. This information will provide a transport plan that can be compared with the mathematical solution of the problem. Correspondingly, the branching structures and their bifurcation angles and radii should be compared with experimental results as well. In principle, Murray's law should be consistent with the Monge-Ampère equation and its linearization, and it should be possible to derive it from them. A precise relationship between the maximum entropy principle and optimal entropy transport should be clarified.
Author Contributions: All authors contributed equally to the manuscript: read and approved the final version, conceptualization, methodology, software, validation, formal analysis, investigation, resources, writing-original draft preparation, writing-review and editing, visualization, supervision and project administration. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Acknowledgments: M.A.P. was supported by Universidad Autónoma de la Ciudad de México, sabbatical approval UACM/CAS/010/19. This work was supported by the Departamento de Matemáticas y Mecánica (MyM) of the Intituto de Investigaciones en Mateámaticas Aplicadas y en Sistemas (IIMAS) of the Universidad Nacional Autónoma de la Ciudad de México (UNAM). M.A.P. would like to thank P.P. and IIMAS-UNAM for the support during his sabbatical leave. The authors would also like to thank Jorge Castillo-Medina at the Universidad Autónoma de Guerrero in Acapulco for his kind permission to use Figures 1 and 2.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Relevant Theorems and Some Proofs
Theorem A1 (The continuity and support theorem). Let µ be a probability measure on R with cumulative distribution function F. The following properties are equivalent: a. The function F is strictly increasing in the interval: The inverse measure µ −1 is non-atomic. d. The support of µ, given by: with Σ a σ-algebra defined on R, is a closed interval in the real line, finite or not.
Proof. The proof of this result can be found in [38], Appendix A.
Proposition A1. Let h ∈ C 1 (R) be a non-negative, strictly convex function. Let µ and ν be Borel probability measures on R given by (11) and (12), respectively. Suppose that: for Ω ∈ R. If µ has no atom and F and G represent the respective cumulative distribution functions of µ and ν, respectively, then: If π is the induced entropy transform plan, that is: defined as T#µ(E) = µ(T −1 (E)) for E ∈ Ω, then π is optimal for Problem (15).

Proof. 1. T is well defined:
The only problem we might have with the definition of T = G −1 • F could be when F(x) = 0. However, if F(x) = 0, then: but if for some a ∈ R, we have F(a) = 0, then µ((−∞, a]) = 0, which means that a = −∞ and T is well defined, as desired. 2. ν = T#µ ν = T#µ ν = T#µ: Let F and G be defined as in observation (1). Then, T = G −1 • F is non-decreasing, since F and G are non-decreasing. Then: Since T is non-decreasing, T −1 ((−∞, y]) is an interval. Claim 1. Since µ has no atom, F is increasing and continuous, and then, T −1 ((−∞, y]) is a closed interval.
Proof. The existence of a maximizing pair and Relation (A8) have been proven in Proposition A1. Now, choosing an optimal π ∈ Π(µ, ν), we have: The proof is completed.
We have proven the theorem.
We have proven that the Laplace equation can be seen as the linearized version of the Monge-Ampère equation for our proposal.