The World as a Neural Network

We discuss a possibility that the entire universe on its most fundamental level is a neural network. We identify two different types of dynamical degrees of freedom: “trainable” variables (e.g., bias vector or weight matrix) and “hidden” variables (e.g., state vector of neurons). We first consider stochastic evolution of the trainable variables to argue that near equilibrium their dynamics is well approximated by Madelung equations (with free energy representing the phase) and further away from the equilibrium by Hamilton–Jacobi equations (with free energy representing the Hamilton’s principal function). This shows that the trainable variables can indeed exhibit classical and quantum behaviors with the state vector of neurons representing the hidden variables. We then study stochastic evolution of the hidden variables by considering D non-interacting subsystems with average state vectors, x¯1, …, x¯D and an overall average state vector x¯0. In the limit when the weight matrix is a permutation matrix, the dynamics of x¯μ can be described in terms of relativistic strings in an emergent D+1 dimensional Minkowski space-time. If the subsystems are minimally interacting, with interactions that are described by a metric tensor, and then the emergent space-time becomes curved. We argue that the entropy production in such a system is a local function of the metric tensor which should be determined by the symmetries of the Onsager tensor. It turns out that a very simple and highly symmetric Onsager tensor leads to the entropy production described by the Einstein–Hilbert term. This shows that the learning dynamics of a neural network can indeed exhibit approximate behaviors that were described by both quantum mechanics and general relativity. We also discuss a possibility that the two descriptions are holographic duals of each other.


Introduction
Quantum mechanics is a remarkably successful paradigm for modeling physical phenomena on a wide range of scales ranging from 10 −19 m (i.e., high-energy experiments) to 10 +26 m (i.e., cosmological observations.) The paradigm is so successful that it is widely believed that, on the most fundamental level, the entire universe is governed by the rules of quantum mechanics and even gravity should somehow emerge from it. This is known as the problem of quantum gravity that so far has not been solved, but some progress had been made in the context of AdS/CFT [1][2][3], loop quantum gravity [4][5][6], and emergent gravity [7][8][9]. Although extremely important, the problem of quantum gravity is not the only problem with quantum mechanics. The quantum framework also starts to fall apart with introduction of observers. Everything seems to work very well when observers are kept outside of a quantum system, but it is far less clear how to describe macroscopic observers in a quantum system, such as the universe itself. The realization of the problem triggered an ongoing debate on the interpretations of quantum mechanics, which remains unsettled to this day. On one side of the debate, there are proponents of the many-worlds interpretation claiming that everything in the universe (including observers) must be governed by the Schrödinger equation [10], but then it is not clear how classical probabilities would emerge. One the other side of the debate, there are proponents of the hidden variables theories [11], but there it is also unclear what is the role of the complex wave-function in a purely statistical system. It is important to emphasize that a working definition of observers is necessary not only for settling some philosophical debates, but for understanding the results of real physical experiments and cosmological observations. In particular, a self-consistent and paradoxes-free definition of observers would allow for us to understand the significance of Bell's inequalities [12] and to make probabilistic prediction in cosmology [13]. To resolve the apparent inconsistency (or incompleteness) in our description of the physical world, we shall entertain an idea of having a more fundamental theory than quantum mechanics. A working hypothesis is that, on the most fundamental level, the dynamics of the entire universe is described by a microscopic neural network that undergoes learning evolution. If correct, then not only macroscopic observers, but, more importantly, quantum mechanics and general relativity should correctly describe the dynamics of the microscopic neural network in the appropriate limits. (Note that the idea of using neural networks to describe gravity is not new and was recently explored in contexts of quantum neural networks [14], AdS/CFT [15] and emergent gravity [16].) In this paper, we shall first demonstrate that near equilibrium the learning evolution of a neural network can indeed be modeled (or approximated) with the Madelung equations (see Section 5), where the phase of the complex wave-function has a precise physical interpretation as the free energy of a statistical ensemble of hidden variables. The hidden variables describe the (classical) state of the individual neurons whose statistical ensemble is given by a partition function and the corresponding free energy. This free energy is a function of the trainable variables (such as bias vector and weight matrix), whose stochastic and learning dynamics we shall study (see Section 4). Note that, while the stochastic dynamics generically leads to the production of entropy (i.e., second law of thermodynamics), the learning dynamics generically leads to the destruction of entropy (i.e., second law of learning). As a result in the equilibrium, the time-averaged entropy of the system remains constant and the corresponding dynamics can be modeled while using quantum mechanics. It is important to note that the entropy (and entropy production) that we discuss here is the entropy of either hidden or trainable variables which need not vanish even for pure states. Of course, one can also discuss mixed states and then the corresponding von Neumann entropy gives an additional contribution to the total entropy.
The situation changes dramatically, whenever some of the degrees of freedom are not thermalized. While it should, in principle, be possible to model the thermalized degrees of freedom using quantum theory, the non-thermalized degrees of freedom are not likely to exactly follow the rules of quantum mechanics. We shall discuss two non-equilibrium limits: one that can nevertheless be described using classical physics (e.g., Hamiltonian mechanics) and the other one that can be described using gravitational physics (e.g., general relativity). The classical limit is relevant when the non-equilibrium evolution of the trainable variables is dominated by the entropy destruction, due to learning, but the stochastic entropy production is negligible. The dynamics of such a system is well approximated by the Hamilton-Jacobi equations with free energy playing the role of the Hamilton's principal function (see Section 6). The gravitational limit is relevant when even the hidden variables (i.e., state vectors of neurons) have not yet thermalized and the stochastic entropy production governs the non-equilibrium evolution of the system (see Section 9). In the long run, all of the degrees of freedom must thermalize and then quantum mechanics should provide a correct description of the learning system.
It is well known that, during learning, the neural network is attracted towards a network with a low complexity, a phenomenon also known as dimensional reduction or what we call the second law of learning [16]. An example of a low complexity neural network is the one that is described by a permutation weight matrix or when the neural network is made out of one-dimensional chains of neurons. (Note that a similar phenomenon was recently observed in context of the information graph flow [17].) If the set of state vectors can also be divided into non-interacting subsets (or subsystems) with average state vectors,x 1 , ...,x D , and an overall average state vectorx 0 , then the dynamics ofx µ can be described with relativistic strings in an emergent D + 1 dimensional space-time (see Section 8). In general, the subsystems would interact and then the emergent space-time would be described by a gravitational theory, such as general relativity (see Section 9). Note that, in either case, the main challenge is to figure out exactly which degrees of freedom have already thermalized (and, thus, can be modeled with quantum mechanics) and where degrees of freedom are still in the process of thermalization and should be modeled with other methods, such as Hamiltonian mechanics or general relativity. In addition, we shall discuss yet another method, which is motivated by the holographic principle, and it is particularly useful when the bulk neurons are still in the process of equilibration, but the boundary neurons have already thermalized (see Section 10).
The paper is organized, as follows. In Section 2, we review the theory of neural networks and, in Section 3, we discuss a thermodynamic approach to learning. In Section 4, we derive the action that governs dynamics of the trainable variables by applying the principle of stationary entropy production. The action is used to study the dynamics near equilibrium in Section 5 (which corresponds to quantum limit) and further away from equilibrium in Section 6 (which corresponds to classical limit). In Section 7, we study non-equilibrium dynamics of the hidden variables and, in Section 8, we argue that, in certain limits, the dynamics can be described in terms of relativistic strings in the emergent space-time. In Section 9, we apply the principle of stationary entropy production to derive the action that describes equilibration of the emergent space-time (which corresponds to gravitational limit) and, in Section 10, we discuss when the gravitational theory can have a holographic dual description as a quantum theory. In Section 11, we summarize and discuss the main results of the paper.

Neural Networks
We start with a brief review of the theory of neural networks by following the construction that was introduced in Ref. [16]. The neural network shall be defined as a neural septuple (x,P in ,P out ,ŵ, b, f , H), where x ∈ R N , is the state vector of neurons,P in andP out are the projection operators to subspaces spanned by, respectively, N in , input and, N out , output neurons,ŵ ∈ R N×N , is a weight matrix, b ∈ R N is a bias vector, f : R → R is an activation function and H : R N × R N × R N×N → R is a loss function. This definition is somewhat different from the one usually used in the literature on machine learning, but we found that it is a lot more useful for analyzing physical theories in the context of a microscopic neural network that we are interested in here. We shall not distinguish between different layers and, so, all N neurons are connected into a single neural network with connections that are described by a single N × N weight matrix,ŵ. The matrix can be viewed as an adjacency matrix of a weighted directed graph with neurons representing the nodes and elements of the weight matrix representing directed edges. However, we will distinguish between two different types of neurons: the boundary neurons, N ∂ = N in + N out , and the bulk neurons, N / ∂ = N − N ∂ . Similarly, the boundary and bulk projection operators are defined, respectively, asP ∂ =P in +P out andP / ∂ =Î −P ∂ .
The state vector of neurons, x ∈ R N , or just state vector, evolves in discrete time-steps, according to equation which can also be written in terms of components Note that summations over repeated indices are implied everywhere in the paper unless stated otherwise (e.g., . A crucial simplification of the dynamical system (1) was to assume that the activation map f : R N → R N acts separately on each component (2) with some activation function f (x). Logistic function f (x) = (1 + exp(x)) −1 and rectified linear unit f (x) = max(0, x) are some important examples of the activation function, but we shall use the hyperbolic tangent f (x) = tanh(x), which is also widely used in machine learning.
The main reason is that the hyperbolic tangent is a smooth odd function with a finite support that greatly simplifies analytical calculations that we shall carry out in the paper.
The main problem in machine learning, or the main learning objective, is to find a bias vector, b, and a weight matrix,ŵ, which minimize some suitably defined loss function H (x, b,ŵ). In what follows, we shall consider two loss functions: the "bulk" loss and the "boundary" loss. The bulk loss function is defined as a local sum over all neurons The first term represents the sum over squares of local errors or, equivalently, differences between the state of a neuron before, x i , and after, f w ij x j + b i , a single execution of the activation map. The second term represents a local objective, such as a binary classification of the signal x i . For example, if V(x i ) = − m 2 x 2 i , then the values of x i closer to lower-and upper-bounds are rewarded and values in-between are penalized. Although the bulk loss is much easer to analytically analyze, in practice it is often more useful to define the boundary loss function by summing over only boundary neurons, In fact, the boundary loss is usually used in supervised learning, but, as was argued in [16], the bulk loss is more suitable for unsupervised learning tasks. Instead of following the dynamics of the individual states, which might be challenging, one can use the principle of maximum entropy [18,19] to derive a canonical ensemble of states [16]. The corresponding canonical partition function is and the free energy is At a constant "temperature", T = β −1 = const, the ensemble can evolve with time either due to internal (or what we shall call hidden) dynamics of the state vector, x(t), or due to the external (or what we shall call training) dynamics of the bias vector, b(t), and weight matrix,ŵ(t). The partition function for the bulk loss function (3) with a mass-term potential, V(x i ) = − m 2 x 2 i , and a hyperbolic tangent activation function, f (x) = tanh(x), was calculated in [16] while using Gaussian approximation. The result is whereĜ andf is a diagonal matrix of first derivatives of the activation function,

Thermodynamics of Learning
Given the partition function, the average loss can be calculated by a simple differentiation, If the neural network was trained for a long time, then the weight matrix and the bias vector are in a state that minimizes (at least locally) the average loss function and then its variations with respect toŵ and b must vanish, We shall call this state, the state of the learning equilibrium. An important property of the equilibrium, which follows from (11), is that the total free energy must decompose into a sum of two terms Likewise, the total entropy must also decompose into a sum of two terms, where the first term is the familiar thermodynamic entropy and the second term, C(b,ŵ), is related to the complexity of the neural network (see Ref. [16]). As the learning progresses, the average loss, U(β), decreases, the temperature parameter, β −1 , decreases, and, thus, one might expect that the thermodynamic entropy, S 0 , should also decrease. However, it is not the thermodynamic entropy, S 0 , but the total entropy, S x , (whose exponent describes accessible volume of the configuration space for x) should decrease with learning. We call it the second law of learning: Second Law of Learning: the total entropy of a learning system can never increase during learning and is constant in a learning equilibrium, In the long run the system is expected to approach an equilibrium state with the smallest possible total entropy, S x , which corresponds to the lowest possible sum of the thermodynamic entropy, S 0 (β), and of the complexity function C(b,ŵ).
For a system transitioning between equilibrium states at constant temperature, T = 1/β, variations of the free energy must vanish, dF = 0, and then Equation (12) takes the form of the first law, or what we call the first law of learning: First Law of Learning: the increment in the loss function is proportional to the increment in the thermodynamic entropy plus the increment in the complexity

Entropic Mechanics
So far, the neural networks were analyzed by considering statistical ensembles of the state vectors, x, but the bias vector, b, and weight matrix,ŵ, were treated deterministically. The next step is to promote b andŵ to stochastic variables in order to study their near-equilibrium dynamics. In the next section, we will show that the training dynamics of b andŵ can be approximated by Madelung equations, with x playing the role of the hidden variables. For this reason, we shall refer to the bias vectors and weight matrices as "trainable" variables and to the state vectors as "hidden" variables. This does not mean that the trainable variables are the quantized versions of the corresponding classical variables, but only that their stochastic evolution near equilibrium can often be described by quantum mechanics.
Consider a family of trainable variables, b(q) andŵ(q), parametrized by dynamical parameters q k 's, where k ∈ (1, ..., K). Typically, the number of parameters K is much smaller than N + N 2 (i.e., the number of parameters required to describe a generic vector b and a generic matrixŵ) and the art of designing a neural architecture is to come up with functions b(q) andŵ(q), which are most efficient in finding solutions. To make the statement more quantitative, consider an ensemble of neural networks described by a probability distribution p(t, q), which evolves with time according to a Fokker-Planck equation If we assume that the learning evolution (or the drift) is in the direction of the gradient of the free energy, then ∂p ∂t This may be a good guess on short-time scales when the free energy does not change much, but in general both p(t, q) and F(t, q) can depend on time explicitly and implicitly though variable q. To describe such dynamics, we shall employ the principle of stationary entropy production (see Ref. [20]):

Principle of Stationary Entropy Production:
The path taken by a system is the one for which the entropy production is stationary.
The principle can be thought of as a generalization of both, the maximum entropy principle [18,19] and the minimum entropy production principle [21,22], which is often used in non-equilibrium thermodynamics. In context of neural networks, it is beneficial to have large entropy, as it implies a higher rate with which new solutions can be discovered. Subsequently, the optimal neural architecture should be the one for which the entropy destruction is minimized or, equivalently, the entropy production is maximized. This justifies the use of the principe in context of the optimal learning systems [16]. The Shannon entropy of the distribution p(t, q) (not to confuse with S x (β, q)) is given by and using (20), the entropy production is given by which can be simplified (after integrating by parts and ignoring the boundary terms, i.e., by assuming periodic or vanishing boundary conditions), This quantity is a functional of both p(t, q) and F(t, q) and, thus, in addition to modeling the dynamics of the probability distribution, we must also model the dynamics of the free energy.
The total rate of change of the free energy is given by where the first term represents the change of the free energy due to dynamics of hidden variables, x, and the second term represents the change in the free energy due to dynamics of trainable variables, b andŵ. In what follows, it will be convenient to denote the time-averaged rate of change of free energy as d dt Subsequently, according to the principle of stationary entropy production, the dynamics of p(t, q) and F(t, q) must be such that the entropy production is extremized subject to a constraint The optimization problem can be solved by defining the following "action", where µ is a Lagrange multiplier, and then the "equations of motion" are obtained by setting variations of the action to zero,

Quantum Mechanics
In the previous section, we developed a stochastic description of the trainable variables q which describe the weight matrixŵ(q) and the bias vector b(q). We argued that, on short time-scales, the dynamics of the probability distribution p(t, q) and of the free energy F(t, q) is given by Equations (20) and (23), but on longer time-scales an approximate dynamics can be obtained while using the principle of stationary entropy production. The corresponding "action" is given by (26), which can be rewritten using (22), The five terms on the right hand side represent: , entropy production due to stochastic dynamics of q k 's, , entropy production due to learning dynamics of q k 's, (3) µ ∂F ∂t , free energy production due to dynamics of x i 's, (4) µγ ∂F ∂q k 2 , free energy production due to learning dynamics of q k 's, and (5) µV, the (negative of) total time-averaged free energy production.
Note that the entropy production due to stochastic dynamics is usually positive (due to the second law of thermodynamics), but the entropy production due to learning dynamics is usually negative (due to the second law of learning). While the learning entropy production is expected to dominate the dynamics far away from an equilibrium, the stochastic entropy production is expected to give the main contribution near equilibrium.
From (28), the equations of motion (27) are obtained by setting variations to zero, It is convenient to define a velocity vector and then (29) can be expressed as a Fokker-Planck equation and (30) as a Naiver-Stokes equation (after differentiating with respect to ∂ Several comments are in order. First of all, the Fokker-Planck Equation (32) differs from the "stochastic" Fokker-Planck Equation (20). This is a consequence of our assumption that (20) is only valid on very short time scales, while, according to the principle of stationary entropy production, Equations (32) and (33) must be valid on much longer time-scales. Secondly, if µ > 0 then the kinetic viscosity in the Naiver-Stokes Equation (33), − γ µ , is negative, which is a consequence of the second law of learning. Additionally, finally, if we neglect the entropy production due to learning (i.e., γ ∂ 2 F ∂q 2 k in (28)), then the resulting equations of motion would be the same as (32) and (33), but with terms in boxes set to zero. These are the well known Madelung equations that are equivalent to the Schrödinger equation for the wave-function defined as Moreover, in this limit, the action (28) takes the form of the Schrödinger action Therefore, we conclude that near equilibrium, i.e., when the first term in (28) is much larger than the second term, our system can be modeled by quantum mechanics.

Hamiltonian Mechanics
The next step is to consider a non-equilibrium dynamics of the trainable variables, which is relevant when the second term in (28) is much larger than the first term. This corresponds to a limit when the entropy destruction is dominated by the learning dynamics and the stochastic entropy production is negligible. The corresponding Fokker-Planck equation remains the same as before (32), but the Naiver-Stokes Equation (33) is greatly simplified In this limit. the dynamics of the free energy F does not depend on the probability distribution p and, thus, Equation (37) decouples from (32) and can be solved separately. In terms of the free energy, the equation of motion (30) is which can be though of as a Hamilton-Jacobi equation for the Hamilton's principle function F and a Hamiltonian function However, note that, in classical mechanics, the Hamiltonian function only depends on q k 's and ∂F ∂q k 's, but, in our case, it also depends on one more variable ∑ k From Equations (19) and (31), we get and then (38) can be rewritten as In the limit when the entropy production (due to both learning and stochastic dynamics) is negligible, (40) and (41) can be used in order to obtain classical equations of motion which has a simple time-independent (i.e., ∂F ∂t = 0) solution that is given by, where C 0 and C k 's are arbitrary coefficients. Note that ∂F ∂t = 0 corresponds to a limit when the change in the free energy production due to dynamics of x i 's being negligible or, in other words, when the training dataset is not dynamical (as is often the case in machine learning).
The solution (44) has an exact form of the free energy for a canonical ensemble (7), with µ = 2β and the dynamical variables q i set to the eigenvalues λ i of the operatorĜ. In this limit, the average loss is where, for simplicity, we have set the mass parameter to zero, m = 0. This equation can be thought of as a viral theorem for our learning system where ∂F ∂λ i is the "force" acting on a "particle" at position λ i . More generally, the eigenvalues λ i 's could be arbitrary functions of q i 's and time t, and then and, thus, the system is Hamiltonian although the kinetic term may not be canonical.

Hidden Variables
We have seen that neural networks can exhibit both quantum (Section 5) and classical (Section 6) behaviors if the dynamics of the trainable variables q (or equivalently of the bias vector b and weight matrixŵ) is followed explicitly, but the dynamics of the hidden variables (or the state vectors x) was expressed only implicitly through ∂F ∂t . For this reason, it was convenient to think of the (classical) state vectors x as hidden random variables whose individual dynamics was shadowed by our statistical description. In this section, we shall be instead interested in non-equilibrium dynamics of the hidden variables, which is relevant, for example, on the time-scales that are much smaller than thermalization time.
Recall that the state of the individual neurons evolves according to (1) which can be approximated to the leading order asx (0) wheref 0 =f is the matrix of first derivative of the activation function (9). More generally, we can consider D non-interacting subsystems of states vectors (e.g., D separate sets of training data), denoted by x (d) where d = 1, ..., D. Subsequently, the overall distribution of the state vectors is in general multimodal with D local maxima,x (d) , and each of these maxima evolves according tō and f d ii It is convenient to define a continuous time coordinate τ such that where η = diag(−1, 1, ..., 1). However, in general, the minimal interactions cannot be ignored and then where the metric tensor g i µν describes the strength of the interactions. Of course, such a description is only valid if the minimal interactions are weak, which is the assumption that we are going to make.
To estimate the dynamics of hidden variablesx µ , we assume that the activation function is linear f d =Î (with the slope set to one without loss of generality) and then from (49) and (50), we havē and (53) becomes According to the second law of learning, it is expected that the neural network must have evolved to a network with a very low complexity, such as a network whose weight matrix is a permutation matrix w =π.
For example, consider a permutation matrix with only a single cycle that (up to permutations of elements) is given by Subsequently, Equation (58) can be rewritten as If we take a continuous limit by definingx (µ) (τ, σ), such that This equation has a simple solution of a periodic "right-moving" wave. In the light-cone coordinates ξ ± ≡ τ ± σ, the equation of motion (63) is and the constraint Equation (55) is

Relativistic Strings
In the last section, we have shown that an equation for a "right-moving" wave (64) can emerge in a statistical description of D minimally-interacting subsystems of state vectors. A natural question arises if a "left-moving" wave can also emerge in some limit and if so can the dynamics be described in terms of relativistic strings in an emergent space-time? To answer this question we first note that the permutation weight matrix (59) (with an arbitrary number of cycles) is such that Because the free energy (45) depends onπ only throughĜ, the very same ensemble of the state vectors can equally likely evolve either towardsπ or towardsπ T . However, if the exact state of the microscopic weight matrix is unknown, then one must consider an ensemble that contains both options and then the average state vector is given bȳ where the two terms represent statistical averages with respect to the two distributions.
Following the analysis of the previous section, the dynamics ofx can be obtained from (58) for the respective weight matrices, In a continuum limit the equations are given by whose solutions represent, respectively, the right-and left-moving waves. Subsequently, the dynamics of the hidden variables (68) is indeed given by a 1 + 1 dimensional wave equation In the light-cone coordinates, the wave equation is and the constraints The action that gives rise to the wave Equation (74) and constraints (75) is the Polyakov action that can be written in a covariant form as where h ab is the world-sheet metric and h is its determinant. In summary, we showed that D non-interacting subsystems of the state vectors x (d) can be described with D + 1 scalar fields in 1 + 1 dimensions. Alternatively, one can view the configuration space of the scalar fields as an emergent space-time and then our system can be described with a motion of relativistic strings in D + 1 dimensions (76). This is very similar to what is usually done in string theory, with one major difference. Our strings arise from the dynamics of the average state vectorsx (µ) and not from the dynamics of the bias vector b and weight matrixŵ which undergo learning. Recall that the trainable variables b andŵ (or equivalently q) near equilibrium can be modeled by quantum mechanics (Section 5) and further away from the equilibrium by classical mechanics (Section 6). In contrast, the state vectorsx (µ) represent hidden variables of the quantum theory, but their dynamics (in certain limits) is conveniently described by relativistic strings.

Emergent Gravity
In Section 7, we showed that interactions between D subsystems can be described by Equation (56), but up until now, the analysis was restricted to g i µν = η µν . In this section, we shall generalize the construction to more general metric tensors g i µν , which is a function of the discrete parameter i and, consequently, a function in the emergent space-time. Moreover, we shall not make any simplifying assumptions about operatorĜ or, equivalently, about the weight matrixŵ and the activation mapf d . Subsequently, by following the procedure of the previous sections, we arrive at a discrete action for the hidden variables (or state vectors), with the corresponding equation of motion The equation of motion (78) and the corresponding action (77) can be considered as a generalization of respectively the wave-equation (73) and of the string acton (76). Nevertheless, the string action can be recovered in the limit of a flat target space, g i µν = η µν , for a permutation weight matrix,ŵ =π, and for a linear activation functionf d =Î.
To study the dynamics in the emergent space-time, it is convenient to rewrite (77) as where g is the determinant of g µν and is the energy-momentum tensor density. The equilibrium dynamics of neural networks was first modeled while using the principle of maximum entropy with a constraint imposed on the loss function [16], but to study a non-equilibrium dynamics of the trainable variables, the principle of the stationary entropy production had to be used with a constraint was imposed on the dynamics of free energy (25). In this section, we study a non-equilibrium dynamics of the hidden variables, and so the constraint should be imposed on the action that describes the dynamics of the state vectors (79). Then, according to the principle of stationary entropy production, the quantity that must be extremized is where √ −gR(g) is the local entropy production density, κ is a Lagrange multiplier, and A is a constant that represents average A. Note that the energy momentum tensor density (80) does not depend on the metric and so varying the corresponding term in (81) with respect to the metric produces the desired result However, if we are not following the microscopic dynamics of all of the elements of the bias vector and weight matrix, then it is more useful to define where Q represents the trainable variables in q (or equivalently in b andŵ), which were not averaged over. Subsequently, the action (81) can be written as where L M (g, Q) plays the role of the "matter" Lagrangian and then the energy momentum tensor should be defined as The parameter κ is a Lagrange multiplier which imposes a "global" constraint but one can also impose the constraint "locally" by demanding that and then the total action becomes where Λ is the "cosmological constant".
Recall that the deviations of the metric g µν (X) (or g i µν ) from the flat metric η µν represent local interactions between subsystems (56). Therefore, if our system is in the process of equilibration, then the entropy production should be a local function of the metric tensor. Using a phenomenological approach due to Onsager [23], we can expand the entropy production around equilibrium [24], −gR = −gL µν αβ γδ g αβ,µ g γδ,ν .
After integrating by parts, neglecting boundary terms and collecting all other terms, we get and Thus, upon varying (81) with respect to the metric, we get the Einstein equations where the Ricci tensor is defined as usual Note that, according to definition (93), the Onsager tensor need not be positive definite that would be inconsistent with the second law of thermodynamics, but it is permitted by the second law of learning. It is important to highlight that the Einstein Equation (95) were obtained from a particular form of the Onsager tensor (93), which is very simple and also highly symmetric. With this respect the result is phenomenological, but one might wonder whether the symmetries of the Onsager tensor can also be derived from the first principles. Moreover, since the neural networks can exhibit an approximate behavior descried by quantum mechanics (see Section 5), it would be interesting to see if the symmetries of quantum field theories (such as the standard model) might also emerge from the learning dynamics of a microscopic neural network. Note that the emergence of symmetries is extremely important not only for modeling physics systems with neural networks, but also for designing more efficient artificial neural networks (see Ref. [16]). In this paper, we only considered the emergence of the quantum phase (i.e., U(1) symmetry in Section 5) and the emergence of space-time (i.e., SO(1, D) symmetry in Section 8), and we leave the emergence of more general symmetries for future studies.

Holography
In the preceding sections, we applied the principle of the stationary entropy production to study the dynamics of the neural networks in two different limits. In the first limit the trainable variables q were treated stochastically, but their dynamics was constrained by the hidden variables x through the free energy, F. The resulting dynamics of the system was shown to exhibit quantum and classical behaviors that were described by the functional S q [p, F] (see (28)). In the second limit, the hidden variables x were treated stochastically, but their dynamics were constrained by the trainable variables q through the action, A. The resulting dynamics of the system was shown to exhibit a behavior described by the action of a gravitational metric theory, such as general relativity, S x [g, Q] (see (88)). The two limits are certainly very different: the "gravitational" theory describes very sparse and deep neural networks and, in the "quantum" theory, the network can be very dense and shallow. However, one might wonder if it may possible to map the sparse and deep neural network to the dense and shallow neural network without losing the ability of the neural network to learn. If the answer is affirmative, then this would imply that the two descriptions-quantum and gravitational (or dense and sparse, or shallow and deep)-are dual and either one can be used in order to describe the learning dynamics.
In this section, we shall explore an idea that the duality not only exists, but is also holographic in a sense that the degrees of freedom of the gravitational theory, i.e., x, b andŵ, can be mapped to only boundary degrees of freedom of the quantum theory, i.e., x ∂ , b ∂ andŵ ∂ . The non-equilibrium dynamics of both systems is governed by the principle of stationary entropy production and to justify such a mapping the entropy production of the gravitational system ∆S x should correspond to the entropy production of the quantum system ∆S ∂ q . Roughly speaking, this means that the uncertainty in the position of neurons in the bulk, x, should correspond to the uncertainty in the values of quantum variables on the boundary, i.e., b ∂ andŵ ∂ . For example, consider a mapping that is defined by In a microscopic picture, the gravitational system consists of long chains of neurons (see Section 7) connecting different pairs of the boundary neurons, i and j, but the length of these chains is encoded in the elements of the boundary weight matrix, The smaller the element w ∂ ij , the larger the number of intermediate bulk neurons that connect i to j. Whenever any two chains of neurons i-j and k-l have a chance of intersecting and forming two other chains of neurons i-l and k-j, the entropy of the bulk theory changes. On the other side of the duality, the same event can lead to the corresponding elements w ∂ ij , w ∂ kl , w ∂ kj and w ∂ il to change or, in other words, to the entropy production in the boundary theory. Thus, it is not too unreasonable to expect that the entropy production in both system are related.
The holographic duality can be more precisely formulated by considering the action functionals that determine the dynamics in both theories. In the boundary theory, the action S q p(q ∂ ), F(q ∂ ) is given by Equation (28) and in the bulk theory the action S x [g(X), Q(X)] is given by Equation (88). For the two systems to be dual, the two actions must be proportional or, using (28) and (88), The left hand side describes the bulk gravitational theory, the right hand side describes the boundary theory and the duality transformation is nothing but changes of variables between (g, Q) and (p, F). Note, however, that the boundary theory can only be approximated by quantum mechanics in the limit when the entropy production due to learning (i.e., the quantity in the box in (102)) is subdominant. Therefore the holography described by (101) should be considered as more general than the holography discussed, for example, in the context of the AdS/CFT correspondence where the CFT side is quantum and the AdS side is gravitational.

Discussion
In this paper, we discussed a possibility that the entire universe on its most fundamental level is a neural network. This is a very bold claim. We are not just saying that the artificial neural networks can be useful for analyzing physical systems [25] or for discovering physical laws [26], we are saying that this is how the world around us actually works. With this respect it could be considered as a proposal for the theory of everything, and as such it should be easy to prove it wrong. All that is needed is to find a physical phenomenon which cannot be described by neural networks. Unfortunately (or fortunately), it is easer said than done. It turns out that the dynamics of neural networks is so complex that one can only understand it in very specific limits. The main objective of this paper was to describe the behavior of the neural networks in the limits when the relevant degrees of freedom (such as bias vector, weight matrix, state vector of neurons) can be modeled as stochastic variables that undergo a learning evolution. In this section, we shall briefly discuss the main results and implications of the results for a possible emergence of quantum mechanics, general relativity, and macroscopic observers from a microscopic neural network.
Emergent quantum mechanics is a relatively new [27,28], but rapidly evolving field [20,[29][30][31][32][33], which is based on a set of very old ideas, dating back to the works of de Brogie and Bohm. The de Broglie-Bohm theory (also known as pilot wave theory or Bohmian mechanics) was originally formulated in terms of non-local hidden variables [12] which makes it an easy target. The main new insight is that quantum mechanics may not be a fundamental theory, but only a mathematical tool that allows for to carry out statistical calculations in certain dynamical systems. If correct, then one should be able to derive all of the essential ingredients (complex wave-function, Schrödinger equation, etc.) from first principle. In this paper, we did exactly that for a dynamical system of a neural network which contains two different types of degrees of freedom: trainable (e.g., bias vector and weight matrix) and hidden (e.g., state vector of neurons). What we showed is that the dynamics of the trainable variables near equilibrium is described by Madelung (or equivalently Schrödinger) equations with free energy (for a canonical ensemble of hidden variables) representing the quantum phase (see Section 5), and further away from the equilibrium their dynamics is described by Hamilton-Jacobi equations with free energy representing the Hamilton's principal function (see Section 6). This demonstrates that the neural networks can indeed exhibit emergent quantum and also classical behaviors. It is important to emphasize that the learning dynamics was essential and the stochastic dynamics alone would not have produced the desired result.
Emergent (or entropic) gravity is also a relatively new field [7][8][9], but it is far less clear if or when progress is being made. The main problem is that emergent gravity is not just about gravity, but is also about emergent space [17,[34][35][36], emergent Lorentz invariance [37][38][39][40][41], emergent general relativity [24,42,43], etc. Quite remarkably, neural networks open up a new avenue to address all of these problems in the context of the learning dynamics. It turns out that a dynamical space-time can indeed emerge from a non-equilibrium evolution of the hidden variables (i.e., state vector of neurons) in a manner that is very similar to string theory. In particular, if one considers D minimally-interacting (trough bias vector and weight matrix) subsystems with average state vectors,x 1 , ...,x D (and the total average state vectorx 0 ) then the dynamics ofx µ can be modeled with relativistic strings in an emergent D + 1 dimensional space-time (see Sections 7 and 8) and if the interactions are described by a metric tensor, then the dynamics can be modeled with Einstein equations (see Section 9). Once again, not only stochastic, but also learning dynamics was essential for the equilibration of the emergent space-time to exhibit behavior of a gravitational theory such as general relativity. This demonstrates that the dynamics of a neural network in the appropriate limits can be approximated by both emergent quantum mechanics and emergent general relativity, but the two limits are very different. The gravitational theory describes very sparse and deep neural networks and in the quantum theory, the neural network can be very dense and shallow. However, it is possible that there exists a holographic duality map between the bulk neurons of the deep and sparse network to the boundary neurons of the shallow and dense network (see Section 10).
We now come to one of the most controversial questions: how can macroscopic observes emerge in a physical system? The question is extremely important not only for settling some philosophical debates, but for understanding the results of real physical experiments [12] and cosmological observations [13]. As was already mentioned, our current understanding of fundamental physics does not allow for us to formulate a self-consistent and paradoxes-free definition of observers and a possibility that observers is an emergent phenomenon is certainly worth considering. Indeed, if both quantum mechanics and general relativity are not fundamental, but emergent phenomena, then why canmacroscopic observers not also emerge in some way from a microscopic neural network. Of course this is a lot more difficult task and we are not going to resolve it completely, but we shall mention an old idea that might be relevant here. It is the principle of natural selection. We are not talking about cosmological natural selection [44], but about the good old biological natural selection [45], although the two might actually be related. Indeed, if the entire universe is a neural network, then something like natural selection might be happening on all scales from cosmological (>10 +15 m) and biological (10 +2 − 10 −6 m) all the way to subatomic (<10 −15 m) scales. The main idea is that some local structures (or architectures) of neural networks are more stable against external perturbations (i.e., interactions with the rest of the network) than other local structures. As a result, the more stable structures are more likely to survive and the less stable structures are more likely to be exterminated. There is no reason to expect that this process might stop at a fixed time or might be confined to a fixed scale and, so, the evolution must continue indefinitely and on all scales. We have already seen that, on the smallest scales, the learning evolution is likely to produce structures of a very low complexity (i.e., second law of learning), such as one dimensional chains of neurons, but this might just be the beginning. As the learning progresses these chains can chop off loops, form junctions and according to natural selection the more stable structures would survive. If correct, then what we now call atoms and particles might actually be the outcomes of a long evolution starting from some very low complexity structures and what we now call macroscopic observers and biological cells might be the outcome of an even longer evolution. Of course, at present, the claim that natural selection may be relevant on all scales is very speculative, but it seems that neural networks do offer an interesting new perspective on the problem of observers.
Funding: This research received no external funding.