Entropic Dynamics for Learning in Neural Networks and the Renormalization Group †

: We study the dynamics of information processing in the continuous depth limit of deep feed-forward Neural Networks (NN) and ﬁnd that it can be described in language similar to the Renormalization Group (RG). The association of concepts to patterns by NN is analogous to the identiﬁcation of the few variables that characterize the thermodynamic state obtained by the RG from microstates. We encode the information about the weights of a NN in a Maxent family of distributions. The location hyper-parameters represent the weights estimates. Bayesian learning of new examples determine new constraints on the generators of the family, yielding a new pdf and in the ensuing entropic dynamics of learning, hyper-parameters change along the gradient of the evidence. For a feed-forward architecture the evidence can be written recursively from the evidence up to the previous layer convoluted with an aggregation kernel. The continuum limit leads to a diffusion-like PDE analogous to Wilson’s RG but with an aggregation kernel that depends on the the weights of the NN, different from those that integrate out ultraviolet degrees of freedom. Approximations to the evidence can be obtained from solutions of the RG equation. Its derivatives with respect to the hyper-parameters, generate examples of Entropic Dynamics in Neural Networks Architectures (EDNNA) learning algorithms. For simple architectures, these algorithms can be shown to yield optimal generalization in student- teacher scenarios.


Introduction
Neural networks are information processing systems that learn from examples. Loosely inspired in biological neural systems, they have been used for several types of problems such as classification, regression, dimensional reduction and clustering [1]. Biological systems selection is based on a measure of performance that combines not only accuracy but also ease of computation and implementation. Predictions based on expectations over posterior Bayesian distributions may lead to saturating bounds for optimal accuracy learning but will typically lack in ease of computation and speed in reaching a result. Neural networks are parametric models and if we don't address the determination of the architecture, which we don't in this paper, the problem of learning from examples is reduced to obtaining fast estimates of the weights or parameters, avoiding the integration over large dimensional spaces. The spectacular explosion of applications in several areas is witness to the fact that several training methods and large data sets are available. Despite these victories, the mechanisms of information dynamics processing remain obscure and despite several decades of theoretical analysis using methods of Statistical Mechanics, much remains to be understood. Here we study on-line learning in feed-forward architectures, where (input,output) examples are presented one at a time. Theoretical analysis is easier than for batch or off-line learning where the cost function depends on a large number of example pairs, however on-line accuracy performance remains high. This is in part due to the fact that since the cost function changes from example to example, the local minima of the cost function that plague off-line learning are not so important. Local stationary points of the learning dynamics are still a problem, but good performances are possible. An important problem to be addressed is what cost function is the most appropriate. If an algorithm is going to be successful it has to approach Bayesian estimates for the available information. But any Bayes algorithm leads to high, even in the millions, dimensional integrals. Monte Carlo strategies cannot be used if simplicity is a requirement. The strategy to determine optimized algorithms for on-line learning has been studied in the past for restricted scenarios and architectures. We present a more general approach, with the following strategy. We are in a situation of incomplete information, thus a probability distribution represents, at a given point in the dynamics, what is known about the parameters. We have to commit to a family of distributions and we choose a Maxent family. Location hyperparameters give the current estimate of the weights. A new (input,output) example pair arrives and Bayes rule permits an update. The choice of the likelihood is a reflection of what we know about the architecture of the NN. In general it is not conjugated to the chosen family.
Still, the Bayes posterior, while not in the family, points to a unique member of the family, since it imposes new constraints on the expected values of the generators.
The resulting learning algorithm is the entropic dynamics imposed by the arrival of information in the examples that induces a change of the hyperparameters of the family. It turns out that changes in the weights are in the direction of decreasing the model Bayesian evidence and it is a stochastic gradient descent algorithm, where the cost function is the log evidence of the model.
The denominator of the Bayes update can be interpreted either as the evidence of the model or alternatively as the predictive probability distribution of the output conditioned on the input and the weights. Once it is written as the marginalization over the internal representation, i.e. the activation values of the internal units, of the joint distribution of activities of the whole network, and under the supposition that the information flows only from one layer to the next, a Markov chain structure follows. Recursion relations of the partial evidence up to a given internal layer are obtained and in the continuous depth limit (CDL) a Fokker-Planck parabolic partial differential equation is obtained. It generalizes Wilson's Renormalization Group [2] diffusion equation for general kernels. The usual, e.g., majority rule that eliminates high frequency degrees of freedom are replaced by the weights of the NN. The RG dynamics can be seen as a classifier of Statistical Mechanics microstates into thermodynamics states. A NN extracts the relevant degrees of freedom that describe the macroscopic concept onto which an input pattern is to be assigned. The first authors to relate the RG and NN were [3] and [4] generating a large flow of ideas into the possible connections between these two areas [5][6][7].

Maxent Distributions and Bayesian Learning
Let f a (w), for a = 1, ...K, w ∈ IR N , be the generators of a family Q of distributions Q(w|λ). If information about w is given in the form of constraints IE Q ( f a ) = F a , for the set of numbers {F a } a=1,K , the Maxent distribution is where z ensures normalization. Then Now consider a system learning a map from inputs x to outputs y, and the model is a known function which depends on a parameter array w: y = T(x; w). The aim of learning is to obtain the parameters from the information in the learning set D n = {(x i , y i )} i=1,n . We want to obtain a distribution for the parameters and consider that up to n − 1 examples the information is coded in a member of the Q family: Q(w|λ n−1 ) = Q n−1 . Calling the likelihood of the problem L n = P(y n |x n , w), the product rule permits the Bayesian updating where the partition function or the evidence is Z(y n |x n , λ n−1 ) = Q n−1 L n dw = P(y n |x n λ n−1 ). The Bayes posterior given by eq. 3 in general doesn't belong to the Q family. We have to choose the member of the family that is closest to the Bayes posterior. This is the Maxent posterior. The way to proceed is based on the fact that a member of the Q family is determined solely by the values of the constraints {F a }. The Bayes posterior defines a set of values for the constraints { f a }. It points in a unique way to the Maxent posterior Q n within the family Q, obtained at the extreme of subject to the only possible constraints on its expected values IE n ( f a ) which are taken to be the Bayes posterior expected values f a . Then for every generator Subtract from both sides F n−1 a , and use equation 2, then since the likelihood is independent of the Lagrange multiplier. This learning dynamics is deduced from entropy maximization and thus will be called Entropic dynamics. Learning occurs along the gradient of the log evidence. It will turn out that the sign is such that typically the evidence for the new model is higher than before learning. These equations hold for any family, but it is interesting to consider the case that will be most likely to be useful in practice, where the family is determined by the functions f 0 = 1, f i = w i and f ij = w i w j , for i, j = 1, N. The constraints after n examples are the normalization, IE(w i ) =ŵ ni and IE(w i w j ) = (C n ) ij +ŵ niŵnj . The result is the gaussian family Q ∝ exp(−λ 0 − ∑ i λ i w i − ∑ ij λ ij w i w j ). The entropic dynamics update equations, driven by the arrival of the n th example areŵ n =ŵ n−1 + C n−1 .∇ŵ n−1 log Z n , C n = C n−1 + C n−1 .∇ 2ŵ n−1 log Z n .C n−1 .
For a layered network, these are the equations associated to the update of the weights afferent to a particular unit in layer d from unit i in layer d − 1 and of the component of the covariance matrix describing the correlation between weights coming from units i and j. The update equations, induced by a maximum entropy approximation to Bayesian learning is the learning algorithm of the neural network which implements the map y = T(x;ŵ). An approximation to this scheme was found for simple networks with no hidden units using a variational procedure ( [8]) and applied to several architectures [9][10][11][12][13]. Then Opper [14] showed the Bayesian connection, explored elsewhere [15]. Recently it has been applied to societies of interacting neural networks [16][17][18][19]. While [12] attacked the neural network with a hidden layer, the challenge remains to study networks with deep architectures.

Deep Multilayer Perceptron
In this section we show that the evidence for a multilayer feedforward neural network can be written recursively as a map. Actually we will get two maps that are essentially the same. This type of map is typical of Renormalization Group transformations and in a continuous limit representation of the neural network as a field theory, we will show that the map leads to a partial differential equation analogous to Wilson's diffusion-like RG equation.
We fix our attention at the n th example, and hence don't write temporal (lower) indices anymore. A layer (upper) index now appears and x d is the internal representation at the the unit layer d. Layers start with d = 0 and the depth of the network is D. Layer d weights are collectively denoted w d and individually w d ij is the weight connecting unit i at layer d − 1 to unit j at layer d. The data pair used for the learning step are X 0 and y. The distributions of the representation at the input is δ(x 0 − X 0 ) and at the output δ(x D − y). The partition function Z(y n |x n , λ n−1 ) in Equation (3) is Z(X D |x 0 , λ) = Q(w|λ)Ldw, where Q(w|λ) is the prior joint distribution of the weights over all the layers. We will eventually take this to be a product over layers, Q(w|λ) = ∏ D−1 d=1 Q(w d |λ d ). which will permit a simpler analytical treatment, but it is not a necessity at this moment. To obtain the likelihood we marginalize the joint distribution of the internal representations P(x D , x D−1 ....x 1 |x 0 , w 1 , ...w D ) over all internal representations at the hidden units doing the same trick that leads to the Chapman-Kolmogorov equation The evidence can be written as where is the joint transition distribution. Define the partially integrated Z d for any d = 1....D It satisfies the recursion and the evidence is At this point this is analogous to a Statistical Mechanics (SM) or euclidean field theory (EFT) partition function in which all field configurations with momentum components above a cutoff have been integrated out. The equivalent of the effective action of the EFT, or the renormalized hamiltonian in the SM is − log Z d .
Since the prior is also a product, then the partition function We integrate over x 0 and x D with the constraints that their distribution are deltas at the input X 0 and output y.
Define the evidence up to a given layer ρ( The last step for the map of a network of depth D is for x D = y leading to the evidence of the model defined by the architecture of the network with weight and hyperparameters given by the set of λ d : Define a layer to layer transition distribution then, we have a map that gives the evidence after d layers as an integral over internal representations at layer d − 1 of the evidence at layer d − 1 with a kernel Q T that implements an aggregation RG-like step: We have obtained two RG-like maps, Equations (13) and (20). Z d depends on all internal representations from layer d to D and on all the hyperparameters λ. The simpler ρ d only depends on the internal representation at layer d and on the hyperparameters of the previous layers. The map for Z d is simpler and the map for ρ d requires, at each step the input on the transition distribution Q T (x d |x d−1 , λ d ).
The transition distribution describes the renormalization group like transformation implemented by the neural network that takes the internal representation at one layer to the next. It is simple to see that

Generalized RG Differential Equation of a Neural Network in the Continuous Depth Limit
The layer index is obviously discrete, but we can take the continuous limit, where now layers are represented by a time like τ variable. A discrete variable i still labels the units. The evidence at depth τ is related to the evidence at depth τ 0 by a generalization of Equation (20): where the integration measure Dx = ∏ i dx i . The distribution Q T (x(τ)|x (τ 0 ), λ) is the probability, that a network with parameters λ, conditional on being in state x at τ 0 has an internal representation x at depth τ. It must satisfy the composition law For a deterministic neural network, conditional on the weights w, the evolution of the internal representation is given by the transfer function. To obtain a well behaved limit it is supposed to vary slowly: so that interpretation ofb is the gradient of the transfer function. The transition distribution is obtained by integrating over all configuration of the weights in the slice. We have chosen a Gaussian family to represent the informational state of the network, which now takes the form of a product of Gaussians for all τ slices: where ∆w = w −ŵ τ and λ = {ŵ τ , C τ } for all values of τ, but only the hyperparameters of the particular slice under consideration matters. To define the continuous limit we impose that the limits below exit: At each layer the drift vector b(x , τ, λ) is the expected value of the change in internal representation and the diffusion matrix B ij (x , τ, λ) to the expectation of quadratic change, which are related to the expected values of the gradient and Hessian of the transfer function respectively. As usual, take the time derivative of the expected value, with respect to Q T (x|x , λ) of a well behaved test function g(x). Taylor expand g(x) around x and integrate by parts, use that g(x) is arbitrary and obtain that Q T satisfies a parabolic PDE and so does the evidence (see Equation (22)) The long time limit of Equation (26) is the predictive distribution ρ(y, τ = D) = P(y|x 0 , λ). Equation (26) is a generalization of an analogous diffusion equation which appears in Wilson's incomplete integration formulation of the renormalization group (e.g., [2]). It extends the type of transformation by permitting that the transformations that leads from τ to τ + dτ are not a simple spatial average, which would eliminate high spatial frequency components. Instead, the transformations are mediated by the weightsŵ. It differs from the usual statistical mechanics or field theories also in the following sense. In those approaches, the transformationŵ is known and uniform and the aim is to obtain the final ρ D , which describes the infrared limit or the thermodynamics of the theory. In supervised learning in neural networks, the starting point, defined by the input X 0 and the output Y are given. The problem is to find the correct set of weightsŵ that implements the correct input-output association. There are two regimes for the neural network. In the learning phase the set of examples is a set of microscopic-macroscopic variables that describe a task. The aim of learning is to determine the appropriate generalized RG transformation that maps from the microscopic description to the macroscopic. After learning, the network is used to find out, for the current RG transformation, the unknown macroscopic generalized thermodynamics or infrared properties associated to the microstate. The next step is to derive optimized learning algorithms, from the solutions of Equation (26) and the EDNNA learning described by (7) and (8).