Towards a Theory of Quantum Gravity from Neural Networks

Neural network is a dynamical system described by two different types of degrees of freedom: fast-changing non-trainable variables (e.g., state of neurons) and slow-changing trainable variables (e.g., weights and biases). We show that the non-equilibrium dynamics of trainable variables can be described by the Madelung equations, if the number of neurons is fixed, and by the Schrodinger equation, if the learning system is capable of adjusting its own parameters such as the number of neurons, step size and mini-batch size. We argue that the Lorentz symmetries and curved space-time can emerge from the interplay between stochastic entropy production and entropy destruction due to learning. We show that the non-equilibrium dynamics of non-trainable variables can be described by the geodesic equation (in the emergent space-time) for localized states of neurons, and by the Einstein equations (with cosmological constant) for the entire network. We conclude that the quantum description of trainable variables and the gravitational description of non-trainable variables are dual in the sense that they provide alternative macroscopic descriptions of the same learning system, defined microscopically as a neural network.


Introduction
Quantum mechanics is a well-defined mathematical framework that proved to be very successful for modeling a wide range of complex phenomena in high energy and condensed matter physics, but it fails to give any reasonable explanations for a phenomenon as simple as a measurement, i.e., the measurement problem. It is completely unclear what is actually happening with the wave-function during the measurement and what role (if any) observers play in the process. Unfortunately, none of the current interpretations of quantum mechanics provide a satisfactory answer to the above questions. In the Copenhagen interpretation it is simply postulated that during measurement the wave-function undergoes a sudden collapse. That is fine, but then one should view quantum mechanics as a phenomenological theory with its limits of validity. In the many-worlds interpretation the wave-function describes the state of the entire universe which evolves unitarily and nothing ever collapses [1]. That is an opposite view where quantum mechanics is a fundamental theory, but it is not a very useful theory as it makes no probabilistic predictions that could be checked experimentally. In the recent years, the so-called emergent quantum mechanics is becoming more popular [2][3][4][5][6][7][8][9], but what is usually missing is a microscopic description of the dynamics from which the complex wave-function and the Schrodinger equation could emerge. Moreover, if quantum mechanics does emerge from a statistical theory, for example, due to averaging over some hidden variables [10], then the hidden variables must be non-local [11]. In this paper we describe a microscopic theory of neural networks from which the quantum behavior does emerge (for the trainable variables) and the hidden variables (or the non-trainable variables) are non-local [12,13]. In fact, as we shall see, the very notion of locality is also an emergent phenomenon that arises from the learning dynamics of neural networks.
General relativity is another well-defined mathematical framework that was developed for modeling a wide range of astrophysical and cosmological phenomena, but it is also incomplete since it does not describe what happens in space-time singularities and it does not directly explain the indirect observations of dark matter, dark energy and cosmic inflation. Of course, we can also treat general relativity as a (highly successful, but still) phenomenological theory with its own limits of validity and model all of these phenomena with phenomenological fields, but then certain important questions cannot be answered. And that includes not only very general questions about the nature of dark matter, dark energy and cosmic inflation, but also more specific questions about assigning probabilities to cosmological observations, i.e., the measure problem [14]. Perhaps, in a more complete theory of quantum gravity all of these questions would have answers and, in fact, some progress in developing such a theory had been made in context of AdS/CFT [15][16][17] and loop quantum gravity [18][19][20]. Another possibility is that gravity is an emergent phenomenon [21][22][23][24] similar to thermodynamics, and then it does not make sense to quantize the metric tensor as all other fields, but, instead, we should try to figure out from which microscopic theory the general theory of relativity could emerge. In this paper we describe how not only general relativity, but also quantum mechanics, Lorentz invariance and space-time can all emerge from the learning dynamics of neural networks [12]. Note that the idea of using neural networks to describe gravity was also explored in ref. [25] in the context of quantum neural networks and black holes, and in ref. [26] in the context of matrix models and cosmology.
The paper is organized as follows. In the following section we define the microscopic theory of neural networks and develop a statistical description of the learning phenomenon. In Section 3 we derive Madelung equations which can be used for modeling the dynamics of trainable variables both in and out of equilibrium. In Section 4 we show that if the learning system is capable of adjusting its own parameters such as step size, mini-batch size and/or acceptance of neurons, then the trainable variables must evolve according to the Schrodinger equation. In Section 5 we consider the dynamics of non-trainable variables of individual neurons to show how the null, time-like and space-like vectors emerge. In Section 6 we exploit the freedom of local transformations to define an emergent space-time and the metric tensor. In Section 7 we consider minimally interacting states of neurons to show that the neurons must move along geodesics in the emergent space-time. In Section 8 we argue that the dynamics of non-trainable variables in the entire network must be described by an action which is equivalent to the Einstein-Hilbert action up to a boundary term and by the cosmological constant which imposes a constraint on the number of neurons. The main results of the paper are discussed in Section 9.

Neural Networks
In general, a neural network can be defined as a septuple (x,P, p ∂ ,ŵ, b, f, H), where: 1.
p ∂ P x , probability distribution which describes the training dataset, 4.ŵ, weight matrix which describes connections and interactions between neurons, 5.
b, column bias vector which describes bias in the inputs of individual neurons, 6.
f(y), activation map which describes a non-linear part of the dynamics, 7.
(See ref. [27] for details.) There are two types of degrees of freedom: non-trainable variables q (or the bias vector b and weight matrixŵ) and non-trainable variables x (or the state of boundaryPx and bulk (Î −P)x neurons). The state of the boundary neurons is updated either periodically or randomly from a training dataset which is described by some probability distribution p ∂ (x ∂ ), and between the updates both bulk and boundary neurons evolve according to where the activation map acts separately on each component, i.e., f i (y) = f i (y i ) (e.g., hyperbolic tangent tanh(y), rectifier linear unit function max(0, x), etc.) The main objective of learning is to find the trainable variables q (or the bias vector b and weight matrixŵ) which minimize the time-average of some loss function. For example, the boundary loss function is and the bulk loss function can be defined as where in addition to the first term, which represents a sum of local errors over all neurons, there may be a second term which represents either local objectives or constraints imposed by a neural architecture [27]. Note that the boundary loss is usually used in supervised learning, but the bulk loss may be used for both supervised and unsupervised learning tasks.
To develop a statistical description of learning [27], consider a joint probability distribution over both trainable q and non-trainable x variables, where p(q) and p(x|q) denote, respectively, the marginal and conditional distributions. If the non-trainable variables quickly equilibrate, then their distribution must be given by the maximum entropy distribution [28,29], where β is a Lagrange multiplier which imposes a constraint on the loss function, The corresponding free energy is where the explicit and implicit dependencies of the free energy F(q, t) on time t are due to, respectively, stochastic dynamics of p ∂ (x, t) and learning dynamics of q(t). The total change of the free energy is given by where we assume that the trainable variables experience a classical drift in the direction of the gradient of the free energy, Note that the parameter γ can be either positive or negative depending on whether the free energy is minimized or maximized with resect to a given trainable variable. If evolution is dominated by stochastic dynamics, then according to the second law of thermodynamics the entropy must increase and then the free energy is minimized, but if evolution is dominated by learning, then according to the second law of learning the entropy must decrease and then the free energy can be maximized [27]. We will come back to the issue of sign of γ in the following sections.

Madelung Equations
On the shortest time scales (or when the free energy F(q, t) does not change significantly) the dynamics of the probability density p(q, t) can be modeled by the Fokker-Planck equation, where we used (9) and D is the diffusion coefficient. On longer time scales the dynamics of the free energy is given by (8) and an additional assumption must be made. By following the analysis of refs. [12,13,27] we assume that the long time scale dynamics is governed by the principle of stationary entropy production [9]. The principle states that the path taken by a system is the one for which the entropy production is stationary, subject to whatever constraints are imposed on the system. The entropy production of trainable variables q can be estimated by calculating the total change in the Shannon entropy, which can be expressed as where in the first line we used conservation of probabilities, i.e., d K q p(q, t) = 1 and in the second line we used (10), integrated by parts and neglected the boundary terms by imposing either periodic or vanishing boundary conditions. Equation (13) describes the system on short time scales, but on longer time scales an addition constraint must be imposed to satisfy (8). The overall optimization problem is solved by constraining deviations of the free energy production (8) from its time-averaged value using the method of Lagrange multipliers. The corresponding "action" functional is given by, In the second line we defined the "potential" where ... t is the time average, and in the third line we completed the square to definẽ (16) and also used conservation of probabilities, i.e., d K q p(q, t) = 1.
Note that in refs. [12,13,27] a functional similar to (14) was obtained, but only in a near equilibrium limit or when the entropy production due to learning is negligible, i.e., In contrast, in (14) we completed the square and redefined the free energy F →F which allowed us to keep all the terms. By varying S[p,F, α] with respect to p andF (i.e., original probability distribution, but shifted free energy (16)) we also obtain the Madelung hydrodynamic equations [30], ∂ ∂t with "velocity" of the fluid and "mass" but the "Planck's constant" is nowh Therefore, we conclude, that the Madelung description of trainable variables must remain valid arbitrary far away from the learning equilibrium, suggesting that the effect is more general than previously thought.

Schrodinger Equation
All of the solutions of the Madelung Equations (17) and (18) are also solutions of the Schrodinger equation, but the opposite is not true [31] and so the system is not exactly quantum. To study a limit when a fully quantum behavior emerges we have to assume that the learning system is described by a grand canonical ensemble of neurons and that the exact number of neurons N is unobservable [13]. Then a constant shift in the free energy is where the "chemical potential" of neurons (or another Lagrange multiplier which imposes a constraint on the number of neurons) is defined as Using (22) the functional (14) can be rewritten as, where the "wave-function" is (See ref. [13] for details.) It is assumed thath, D and α are all positive, but γ and µ can be either positive or negative. By combining (21) and (23) we obtain a quadratic equation, whose solutions are For the real solutions to exist, the following inequality must be satisfied Evidently, for |γµ/D| > 4π the inequality (28) cannot be satisfied and thus the quantum (or Schrodinger, but not Madelung) description breaks down. To restore the quantumness the learning system must readjust γ, µ and/or D such that the inequality (28) is saturated. In other words the learning system must decrease either the step size, the mini-batch size and/or the chemical potential by γ, D and µ until However, if we want the "Planck constant" to remain constant, then the chemical potential (23) must be constant, and the only parameters that should vary are the number of neurons, the step size and the mini-batch size, or N, γ and D. Evidently, the learning efficiently which is achievable only through quantumness (e.g., quantum annealing) is tightly connected to the ability of the learning system to dynamically adjust its own parameters (e.g., step size, mini-batch size, number of neurons). On the other hand the Madelung description is always appropriate both in and out of equilibrium.

Lorentz Symmetry
In the previous sections we discussed the entropy production ∆S q of trainable variables q, but the dynamics of non-trainable variables was described only at the level of its free energy. In this section we are interested instead in the entropy production ∆S x of nontrainable variables x which we approximate as a sum of entropy productions of individual neurons, It is assumed that the state of neurons changes quasi-periodically [12], i.e., For concreteness, we assume that d = 3 which corresponds to the three spatial dimensions. Then, the entropy production can be modeled as a function of "displacements", and computational time t, i.e.
In general, there are two contributions to the entropy production: positive due to the second law of thermodynamics and negative due to the second law of learning [27]. The main idea is to model the positive entropy production as some non-negative function σ i,+ (t) ≥ 0 of computational time t and the negative entropy production (or entropy destruction) as some non-positive function σ i,− (ẋ a i (t)) ≤ 0 of displacementsẋ a i (t). Then the total entropy production is given by where we defined a monotonic function In addition, the entropy destruction σ i,− (ẋ a i (t)) ≤ 0 must vanish if there are no displacementsẋ a i = (0, 0, 0) which implies that there are no zeroth and first order terms in a perturbative expansion around origin, i.e.
Here g i,ab (t) is some positive definite matrix and the displacementsẋ a i (t) are assumed to be small so that the third order terms can be neglected. By substituting (36) in (34) we obtain and if we define temporal components of the matrix, g i,00 = −1 and g i,0a = g i,a0 = 0, then (37) takes a more covariant form (For brevity of notations summation over repeated raised and lowered indices is implied everywhere unless explicitly stated otherwise.) for metric signature (−, +, +, +). Note that for very large displacements the third order terms may become important and then the approximation in (36) would break down and the Lorentz symmetry in (38) would be broken. It is convenient to think ofẋ µ i as a four-vector in the tangent space at the "position" of i'th neuron. Indeed, if, macroscopically, one can only observe the entropy (or entropy production), then we have Lorentz invariance in a sense that different representation of the four-vectorsẋ µ i , that are connected to each other through Lorentz transformations, are indistinguishable. In a local equilibrium, the stochastic entropy productionẋ 0 i (t) 2 is balanced by the entropy destruction due to learning g i,abẋ a i (t)ẋ b i (t) and the entropy remains constant. Therefore, the null displacement vectors, g i,µνẋ µ i (t)ẋ ν i (t) = 0, describe neurons in equilibrium, ∆S x,i = 0. Moreover, the time-like displacement vectors, g i,µνẋ µ i (t)ẋ ν i (t) < 0, describe neurons for which stochastic dynamics dominates, ∆S x,i > 0, and the space-like displacement vectors, g i,µνẋ µ i (t)ẋ ν i (t) > 0, (if such displacement vectors can be stable) describe neurons for which learning dynamics dominates, ∆S x,i < 0.

Emergent Space-Time
The local space-time coordinates of individual neurons, x µ i , can be transformed using shifts, rotations and boosts, i.e.
where Λ µ i, µ is a Lorentz matrix. If the matrix g i,µν is transformed using inverse Lorentz matrix Λ then the entropy production does not change, (Note that we adopted a standard notation of primed-unprimed indices often used for coordinate transformations [32]). The main idea is to exploit the freedom of transformations to make an appropriately weighted average of g i,µν matrices as close to the flat metric η µν as possible, where g i = det(g i,ab ) and summation in the exponent is taken over only spatial components, a, b = 1, 2, 3. For simplicity, we assume that all of the local space-times are transformed into "synchronous gauge" with global time coordinate and g 00 (x) = −1 g a0 (x) = 0 (45) g 0a (x) = 0.
Note that from now on the coordinate time is denoted by t = x 0 which need not be the same as computational time.
It is convenient to introduce the curly brackets notation, and then the (weighted average) metric tensor (43) can be expressed as and (weighted average) inverse metric tensor is defined as It is not immediately clear what is the relation between g µν (x) and g µν (x), but if the emergent space-time is nearly flat (43), then we can expand both (47) and (48) around flat metric to obtain and verify that the product of the (weighted average) metric tensor and of the (weighted average) inverse metric tensor is indeed identity, In general, the curly brackets (46) can be used for mapping discrete indices i to continuous spatial coordinates (x 1 , x 2 , x 3 ). For example, the total number of neurons can be expressed as which suggests that should be interpreted as the number density of neurons in the emergent space. Moreover, using the perturbative expansions (49) and (50) we can check that the determinant of the metric tensor g ab is the same as the weighted sum of determinants of g i,ab , i.e., − det g µν (x) = det(g ab )

Geodesic Equation
The proper time of a given neuron can be identified with the square root of the entropy production (65), i.e., If we are interested in a more macroscopic and localized distribution of neurons, then their average entropy production can be approximated using the metric tensor (47), In a continuum limit (56) becomes, which is usually expressed as a square of infinitesimal line element, By integrating the proper time from initial position According to the principle of stationary entropy production, it is expected that the neuron would "travel" along a path (from initial x we obtain the geodesic equation or, equivalently, in terms of proper time where the Christoffel symbol is defined as (See ref. [32] for a detailed derivation of the geodesic equation.) This result suggests that in the limit of minimal interactions, described by the metric tensor g µν (x i ), the localized states of neurons are expected to move along geodesics in the emergent space-time.

Einstein Equations
In this section, we are interested in the total entropy production of the non-trainable variables in the entire neural network during global time interval T, The entropy production of individual neurons (38) in the synchronous gauge is and after integrating by parts we obtain where we have neglected the boundary term. We can also drop the constant term NT (which is irrelevant for variational problems) and rewrite the entropy production using Gaussian integration formula, where in the last line the definition of curly brackets (46) was used. Using the geodesic Equation (61) the total entropy production (67) can be recast into the following form, To proceed further, we make a crucial assumption that on average, In other words, we assume that displacement of i'th neuron depends equally on its own covariance matrix g αβ i and on the weighted average covariance matrix g αβ (x i ). Then by plugging (69) into (67) and using (48) and (53) we get, It is now easy to show that (70) is equivalent to the Einstein-Hilbert action up to a boundary term, i.e., where the Ricci tensor is By setting variations of the entropy production ∆S x [g] with respect to the inverse metric tensor g µν to zero we obtain the vacuum Einstein equations, (See ref. [32] for a detailed derivation of the Einstein equations from the Einstein-Hilbert action.) So far the total number of neurons N was fixed, but, as was argued in ref. [13] and in Section 4, for the quantumness to emerge the number of neurons N must vary. Such variations can be introduced into the variational problem by defining a functional, where we used (52), (53) and (71). By varying S[g, Λ] with resect to the inverse metric g µν , we obtain Einstein equations with cosmological constant, i.e.
Evidently, the Lagrange multiplier 2Λ constraints the average number of neurons and plays a role of the cosmological constant Λ in the gravitational description of non-trainable variables. We recall that in the quantum description of trainable variables (see Section 4) the Lagrange multiplier µ = ±2πh also constraints the average number of neurons, but instead it plays the role of the Planck's constanth. Evidently, the role of the Lagrange multipliers 2Λ and µ = ±2πh in the gravitational description of non-trainable variables and in quantum description of trainable variables is very different.
In statistical description, the parameter Λ would play the role of a "chemical potential" which would be responsible for both "neurogenesis" and "neurodegeneration". If the parameter can vary in time, then for a system with a small number of neurons (e.g., early Universe) Λ would be larger, but for a systems with a large number of neurons (e.g., late Universe) Λ would be smaller. This can potentially explain both: the early-time accelerated expansion (i.e., cosmic inflation) and the late-time accelerated expansion (i.e., the dark energy), but for the former case a more thorough modeling of the spatial variations of the number density of neurons is required. In addition, the dynamics of trainable variables q(t) must be described by either Madelung or Schrodinger equations (see Sections 3 and 4) and thus additional equations of motions must be satisfied and additional constraints must be imposed. However, from the point of view of the metric dynamics, there should exist an appropriately defined energy momentum tensor T µν that would be acting as a source in the Einstein equations, In addition, it is important to model possible deviations from the assumption (69) in the context of astrophysical observations of, for example, dark matter. Of course, all such generalizations require a more careful modeling of the dynamics of the trainable variables which is beyond the scope of this paper.

Discussion
All successful physical models are built on top of mathematical frameworks or theories. These theories are never proven in a rigorous mathematical sense, but instead they are validated through either repeated experiments or observations of the Universe around us. In the twenties century two such theories were first proposed-quantum mechanics and general relativity-and then successfully applied to modeling physical phenomena on a wide range of scales from 10 −19 m (i.e., high-energy experiments) to 10 +26 m (i.e., cosmological observations). However, all of the attempts to treat one of these theories as fundamental, and the other one as emergent have so far failed (i.e., the problem of quantum gravity). In addition, both theories seem to fall apart with introduction of macroscopic observers like ourselves. In some sense, the situation with observers was even worse than with physical phenomena, since we did not even have a mathematical framework for modeling observers. Indeed, there is not a single self-consistent and paradox-free definition of macroscopic observers that could describe what is actually happening with quantum state during measurement (i.e., the measurement problem) or how to assign probabilities to cosmological observations (i.e., the measure problem). Fortunately, the situation is changing and now we do have a mathematical framework of neural networks which can describe many (if not all) biological phenomena [33]. The main question, however, remains: can the theory of neural networks be the fundamental theory [12] from which (not only macroscopic observers [34] or some complex phenomena [35], but) all biological and physical phenomena emerge? If so, then the theories of quantum mechanics and general relativity must not be fundamental, but emergent.
The idea that quantum mechanics can emerge from anything classical, including neural networks, is very counterintuitive. And the main problem is not that in quantum mechanics we are dealing with probabilities and in classical physics everything is deterministic. Even in quantum mechanics the wave-function Ψ(q) evolves deterministically and it is only because of the measurements the probabilities p(q) = |Ψ(q)| 2 arise. In fact, this is not very different from statistical mechanics, but what is difference is that in quantum mechanics not only probabilities, or square-root of probabilities |Ψ(q)|, but also the complex phase of the wave-function Im(log(Ψ(q))), evolves according to the Schrodinger equations. To show that this might be possible in a given dynamical system requires two non-trivial steps. The first step is to provide a microscopic interpretation of the complex phase which, in the case of neural networks, is the free energy of non-trainable variables Im(log(Ψ(q))) = F(q)/h. Note that the microscopic interpretation of the phase was also given in ref. [9] for constrained systems and in refs. [12,27] for equilibrium systems, but as was shown in Section 3 similar results also hold for non-equilibrium systems. The second step is to show that the complex phase, or the free energyF(q) in the case of neural networks, is multivalued. The multivaluedness condition is essential for the fully quantum behaivor to emerge [31] and in the case of neural networks it is satisfied for a grand-canonical ensemble of neurons [13]. In Section 4 we extended this result to non-equilibrium systems that are capable of adjusting its own parameters (e.g., number of neurons, step size, mini-batch size). More precisely, we have shown that the quantum description of neural networks is appropriate for modeling the non-equilibrium dynamics of trainable variables with non-trainable (or hidden) variables modeled through their free energy and the number of neurons constrained by a Lagrange multiplier which plays the role of the Planck constant.
The problem of emergent gravity [21][22][23][24] is even more complicated, just because it is impossible to study the emergence of general relativity until the space, time and space-time symmetries had already emerged. In the context of neural networks, the problem was first studied in ref. [27] and more specifically in ref. [12], but in both cases the description was too phenomenological or architecture-dependent for anything substantial to be said about the nature of dark energy, dark matter or cosmic inflation. In this paper we improved our understanding of the emergent gravity on several fronts. First of all, we showed that the Lorentz symmetries emerge from the equilibrium dynamics for null vectors, from the stochastic entropy production for time-like vectors and from the entropy destruction due to learning for space-like vectors. This is in agreement with a common view that "time" has a thermodynamic origin, but it also suggests that "space" must emerge from learning. Secondly, we used the freedom of Lorentz transformations to define the emergent space-time and the metric tensor which is, by construction, as close as possible to being flat. In fact, it was essential for the space-time to be nearly flat and we expect the relativistic description to break down in regions of high curvature. Thirdly, we considered localized states of neurons, with minimal interactions described by the metric tensor, to show that they must move along geodesics in the emergent space-time. And finally, we showed that the general relativistic description is appropriate for modeling the dynamics of nontrainable variables with trainable variables modeled through their energy-momentum tensor and with the number of neurons constrained by a Lagrange multiplier which plays the role of the cosmological constant.
In conclusion, we would like to emphasize that the quantum and gravitational descriptions presented in this paper are dual in the sense that they provide alternative macroscopic descriptions of the same learning system, defined microscopically as a neural network. This duality does not have an obvious connection to the holographic duality [15][16][17] although such possibility was discussed in ref. [12]. On the other hand, a fully quantum descriptions can only emerge from a neural network if the number of neurons is not fixed in which case a constraint on the number of neurons must be imposed in both sectors, i.e., gravitational and quantum. The Lagrange multiplier which imposes the constraint in the quantum description is the Planck constant (see Section 4), but the Lagrange multiplier which imposes the constant in the gravitational description is the cosmological constant (see Section 8). This implies that a quantum system can only be dual to a gravitational system with cosmological constant as in AdS/CFT [15][16][17], but the sign of the cosmological constant can be arbitrary.