Stochastic Thermodynamics of Learning Parametric Probabilistic Models

We have formulated a family of machine learning problems as the time evolution of parametric probabilistic models (PPMs), inherently rendering a thermodynamic process. Our primary motivation is to leverage the rich toolbox of thermodynamics of information to assess the information-theoretic content of learning a probabilistic model. We first introduce two information-theoretic metrics, memorized information (M-info) and learned information (L-info), which trace the flow of information during the learning process of PPMs. Then, we demonstrate that the accumulation of L-info during the learning process is associated with entropy production, and the parameters serve as a heat reservoir in this process, capturing learned information in the form of M-info.


Introduction
Starting from nearly half a century ago, physicists began to learn that information is a physical entity [1,2,3].Today, the information-theoretic perspective has significantly impacted various fields of physics, including quantum computing [4], cosmology [5], and thermodynamics [6].Simultaneously, recent years have witnessed the remarkable success of an algorithmic approach known as machine learning, which is adept at learning information from data.This paper is propelled by a straightforward proposition: if "information is physical", then the process of learning information must inherently be a physical process.
The concepts of memory, prediction, and information exchange between subsystems have undergone extensive exploration within the realms of Thermodynamics of Information [6] and Stochastic Thermodynamics [7].For instance, Still et al. [8] delved into the thermodynamics of prediction.And, the role of information exchange between thermodynamic subsystems has been studied by Sagawa and Ueda [9], and Esposito et al. [10].This rich toolbox of thermodynamic of information is our main venue to study physics of machine learning process, with motivation to assess the information content of the learning process.
The type of machine learning problems we consider in this study encompasses any algorithmic approach that evolves a Parametric Probabilistic Model (PPM), or simply the model, towards a desirable target distribution through gradientbased loss function minimization.To establish our notation, consider a set of observations denoted by the training dataset B, drawn from an unknown target distribution p * .The PPM, without lose of generality, can be written as the follows: This distribution is parameterized by a set of parameters θ ∈ R M .The objective of learning is to find a set of parameters such that samples drawn from the model, x ∼ p(X|θ), exhibit desirable statistical characteristics.In machine learning practice, one constructs the function ϕ θ (x) with a (deep) neural network and leave the parameter selection task to an optimizer that minimizes a loss function.Examples encompass Energy-based models [11], Large Language Models, Softmax classifiers, Variational Autoencoders (VAEs) [12], among others.
While the information-theoretic approach to this problem is prevalent in the field [13,14,15,16], it has also faced criticisms [17].Our primary motivation for framing learning in a PPM as a thermodynamic process is to facilitate the assessment of the information content inherent in the learning process.The structure of this paper is outlined as follows: Section 2 briefly discusses prior information-theoretic approaches to the learning problem with PPM, and the challenges they encounter.Subsequently, we introduce our own information-theoretic metrics.Finally, sections 3 and 4 employ the thermodynamic framework to address these information-theoretic inquiries.

Information content of PPMs
Locating information within the parametric model, a.k.a. the neural network, remains a fundamental question in machine learning machine [18].This challenge is central to any information-theoretic perspective on machine learning problems.
In a pioneering study, Shwartz-Ziv et al. [19] quantified the internal information within neural networks by estimating the mutual information between inputs and the activities of hidden neurons.Moreover, they employed the information bottleneck theory to interpret the decrease in this mutual information as evidence of data compression during learning.This perspective garnered significant attention in the field [16,15], reinforcing the view of neural networks as an information channel.However, the study encountered critiques [20,17].A primary problem was that the hidden neurons' activity in a neural network constitutes a deterministic function of the input.Such determinism inherently possesses a trivial mutual information value, even prior to any learning.The challenge of defining a well-defined and interpretable (Shannon) information metric in deterministic neural networks has prompted the proposition that neural network information processing is geometric in nature [21] (given that inputs are mapped to a latent space of differing dimensions), rather than information-theoretical.
In a distinct research direction, Ref. [22] addresses the significance of assessing the information content of the model's parameters.In our study, we echo this view, emphasizing that parameters are the primary carriers of learned information within neural networks.Consequently, any information-theoretic measure of learned information by the model should be grounded in parameters rather than the deterministic activity of hidden neurons.However, quantifying the information within parameters poses challenges, primarily due to the elusive nature of their distribution [23].In this section, we introduce two information-theoretic metrics crafted to assess the information content within the learning process of a PPM.This paves the way for computing these quantities within the thermodynamic framework.
To avoid introducing new notation, we also denote B as the ground truth random variable associated with the target distribution p * from which the training dataset is sampled.Subsequently, we represent the action of the optimizer as a map between this ground truth random variable and the desired set of parameters after n optimization steps: The map Λ n incorporates the structure of the loss function, the optimization algorithm, and any hyperparameters related to the optimizer's action.We exclude the initial parameters' value from this map's argument, under the assumption that as n increases, the final set of parameters becomes independent of its initial condition.In Information Theory terminology, this map corresponds to a statistic of the ground truth random variable [24].Moreover, the outcome of this map defines a model, from which the final model-generated sample is sampled: x tn ∼ p(x|θ tn ).Considering that the model-generated sample becomes independent of the ground truth random variable given the parameters, we can express the following Markov chain governing the learning process: The Data Processing Inequality (DPI) associated with this Markov chain serves as our framework to define two information-theoretic metrics that gauge the information content of the model: We have used notations presented in table 2.1.The left-hand side of this inequality quantifies the accumulation of mutual information between the parameters and the training dataset, while the right-hand side characterizes the performance of the generative model, it gauges the accumulation of mutual information between the model's generated samples and the training dataset.We refer to the former as Memorized Information (M-info) and the latter as Learned-information (L-info).We also note that both of these quantities start at zero before the training begins.Thus, their measurements at t n , reveal accumulation of information during the learning process.
Shannon entropy of p t (x) Mutual information between X and Θ at time t The necessity of constraining the information in a model's parameters is highlighted in Ref. [22], echoing the Minimum Description Length Principle [25].Additionally, studies suggest that the SGD optimizer tends to favor models with minimal information in their parameters [23].Recent work by Ref. [26] has even proposed an upper limit for minimizing parameter information to bolster generalization capabilities.These findings suggest that the learning process seeks to minimize the left-hand side of the DPI inequality while simultaneously maximizing the right-hand side, that measures the model performance.This leads us to an ideal scenario where I Θ;B (t n ) = I X;B (t n ), signifying that all memorized information is relevant to the learning task.
We now take one step further in our definition of M-info and L-info.First, the presence of the optimizer map, as referenced in 2, connecting the ground truth source of the training dataset to the parameters, allows us to simplify M-info as follows: M-info Thus, the parameters naturally emerge as the model's memory, where its Shannon entropy measures the stored information during the learning process.
Second, we swap B for Θ tn , in the definition of L-info in cost of losing some information: where ϵ is a non-negative number that equals zero only when the map Λ n outcome is a sufficient statistic for B. For the above expression, the condition of sufficient statistic can be eased as Θ to be sufficient with respect to X.This means the map Λ n preserve all information in B that is also mutual in X.Indeed, in the problem, we are interested in this type of preservative maps that their action on training dataset preserve task-related information.Therefore, we consider I X;Θ as a reasonable proxy to L-info, and we use the two interchangeably: The learning trajectory T depicts the thermodynamic process that take the initial model state to final state.The green area shows the space of family of distribution accessible to the PPM.The red area considers the possibility that the target distribution, p * , is not in this family.

The learning trajectory of a PPM
The time evolution of the PPM is the first clue to frame the learning process as a thermodynamic process.To illustrate this, consider a discretized time interval [0, t n ], which represents the time needed for n optimization steps of the parameters.During this time, the optimizer algorithm draws a sequence of i.i.d samples from the training dataset.We denote this sequence by b n := {b t1 , b t2 , . . ., b tn }, and refer to it as the "input trajectory".Then, the outcome of the optimization defines a sequence of parameters, call it the "parameters' trajectory": θ n := {θ 0 , θ t1 , θ t2 , . . ., θ tn }.Each realization of parameters defines a specific PPM.Consequently, the parameters' trajectory produces a sequence of PPMs: We refer to this sequence as the learning trajectory, depicted in figure 3.1.On the other hand, a thermodynamic process can be constructed solely from the time evolution of a distribution [27].Therefore, we see T as a thermodynamic process.The physics of this process is encoded in the transition rates governing the master equation of this time evolution.Finding the transition rate associated to learning a PPM, is our main task in this section.

The model subsystem
We refer to the subsystem that goes under the thermodynamic process T as the model subsystem.This subsystem has X degrees of freedom, and its microscopic states' realization along the learning trajectory represent model-generated samples at each time step: x ti ∼ p(X|θ ti ).Furthermore, we denote the stochastic trajectory of model-generated samples by x n := {x t1 , x t2 , . . ., x tn }.To avoid confusion with our notation, consider the probability functions p(x ti |θ ti ) and p(x ti−1 |θ ti ), which respectively represent the probability of observing x ti ∈ x n and x ti−1 ∈ x n at time t = t i .Here, the time index of θ aligns with the time index of the PPM, i.e., p ti (X|θ ti ) ≡ p(X|θ ti ), because the PPM is fully defined upon observing the parameters.In contrast, the time index on x denotes a specific observation within x n .To simplify our notation, the absence of a time index on x denotes a generic realization of the random variable X, and we write p(x|θ ti ) instead of p(X|θ ti ).

The parameters subsystem
The parameters of the neural network at each step of optimization represent realization of the parameters subsystem, with Θ degrees of freedom, and the stochastic trajectory θ n .The statistical state of the parameters subsystem is given with the marginal p(θ ti ) at time step t = t i .This marginal state represents the statistic of all possible outcome of training a PPM on specific learning objective.We can think of training an ensemble of computers on the same machine learning task.This allows us to think about the time evolution of the marginal p(θ ti ), and the joint distribution p(x|θ ti )p(θ ti ) during the learning process.we refer to this view as the ensemble view of learning process.In practice, however, we train the PPM only once, and we do not have access to the marginal p(θ ti ).Thus, our model-generated samples are conditioned on specific observations of parameters, θ ti ∼ Θ ti .This defines the conditional view of the learning process, that is fully described by the learning trajectory of the PPM.
In machine learning practice, it is desirable for a training process to exhibit a robust outcome, regardless of who is running the code.One way to achieve this is by imposing a low-variance condition on the statistics of parameters across the ensemble of all learning trials.This condition asserts that the parameters' trajectory across the ensemble is confined to a small region D n ⊂ R M .As n grows larger, this region shrinks, and becomes associated with the area surrounding the target distribution as depicted in Figure 3.1.Under this condition, we can express: The above approximation becomes exact when p(θ n ) assumes the form of a delta-Dirac function, indicating a zerovariance condition in the parameters' dynamics.
The low-variance condition proves invaluable when computing the information-theoretic measurements introduced in section 2. This is because the computation of the M-info I B;Θ and L-info I X;Θ necessitates averaging over the parameters' distribution.However, since we typically train our model just once, we lack direct access to the parameters' distribution throughout the learning trajectory.To overcome this challenge, we introduce the Conditional L-info: Subsequently, under the low-variance condition of the parameters subsystem, we can measure the conditional L-info as a proxy for the L-info: In section 4, we will delve deeper into the evidence supporting the low-variance condition of the subsystem Θ.
We refer to the joint (X, Θ) as the learning system, that embodies the thermodynamic process of learning a PPM.In this section, we will demonstrate that the thermodynamic exchange between model subsystem and parameters subsystem is the primary source producing M-info and L-info during the learning process.Before delving further, we establish two interconnected assumptions about the parameters subsystem: (1) The PPM is over-parameterized; specifically, the subsystem Θ has a much higher dimension compared to the subsystem X.
(2) The parameters subsystem evolves in a quasi-static fashion (slow dynamics).
The foundation for these assumptions in machine learning is clear.Training over-parameterized models represents a pivotal achievement of machine learning algorithms, and the slow dynamics (often termed as lazy dynamics) of these over-parametrized models are well-documented [28,29].These characteristics underscore the significant role of the parameters subsystem in the learning process, akin to that of a heat reservoir.Over-parameterization implies a higher heat capacity for this subsystem compared to the model subsystem.Additionally, the quasi-dynamics align with the behavior of an ideal heat reservoir, which doesn't contribute to entropy production [30].The role of the parameters subsystem as a reservoir aligns with the assumption of a low-variance condition for this subsystem.This is because we expect the stochastic dynamics of a reservoir in contact with the subsystem to be low-variance across the ensemble of all trials.
In this study, we attribute the role of an ideal heat reservoir to the parameters subsystem, with inverse temperature β −1 = 1.In section 4, we delve deeper into the rationale behind this assumption, by examining the stochastic dynamics of parameters under a vanilla stochastic gradient descent optimizer, and highlighting potential limitations of this assumption.

Lagged bipartite dynamics
We want to emphasize that the dynamics of subsystem X is not a mere conjecture or an arbitrary component in this study; rather, it's an integral part of training a generative PPM.This dynamics is inherent in the optimizer action, necessitating a fresh set of model-generated samples to compute the loss function or its gradients after each parameter update.For instance, in the context of EBM, a Langevin Monte Carlo (LMC) sampler can be employed to generate new samples from the model [31].The computational cost of producing a fresh set of model-generated samples introduces a time delay in the parameter dynamics.For instance, when using an LMC sampler, the number of Monte Carlo steps dictates this lag time.Conversely, in the case of a language model, since the computation of the loss function relies on inferring subsequent tokens, the inference latency signifgies the time delay.We denote the lag time parameters with τ .Here, the model subsystem evolves on the timescale δt, while the parameters subsystem evolves on the timescale α = τ δt.In the thermodynamic context, this parameter represents the relaxation time of the subsystem X, under fix microscopic state of subsystem Θ. Conceptually, parameter τ acts as a complexity metric, quantifying the computational resources required for each parameter optimization step.Moreover, the dynamics of the joint (X, Θ) exhibit a bipartite property.This implies that simultaneous transitions in the states of X and Θ are not allowed, given that the observation of a new set of model-generated samples occurs only after a parameter update.
The lagged bipartite dynamics described above can be represented using two time resolutions: δt and α.In the finer time resolution of δt, the Markov chain within the time interval [t i , t i+1 ] is as follows: We can also analyze this dynamics at a coarser time resolution of α.Within the interval [t 0 , t n ], the Markov Chain appears as: In the above Markov chain, the dashed arrows remained us the ignorance of intermediate steps in the high resolution picture 11. Figure 3.2, illustrates the lagged bipartite dynamics of the learning system.An important observation is that the learning trajectory T , as defined in 8, is written in the low resolution picture.Therefore, studying the learning trajectory means studying the dynamics of the system (X, Θ) in the low resolution picture.

Trajectory probabilities
To set the stage for the application of the Fluctuation Theorem (FT) to learning a PPM, we define the trajectory probability of the joint (xn, θn) as the probability of observing a series of model-generated samples and parameters during the learning process: P [x n , θ n ] := p(x 0 , x t1 , . . ., x tn , θ 0 , θ t1 , . . ., θ tn ) Additionally, we can consider the time reversal of the samples' trajectory and parameters' trajectory, respectively, as xn := {x tn , x tn−1 , . . ., x t1 } and θn := {θ tn , θ tn−1 , . . ., θ t1 }.Then, the probability of observing the backward trajectory is denoted by P [x n , θn ].
Here, P [x n , θ n ] represents the trajectory probability of the learning system in the ensemble view.In practice, however, we typically train our model only once, and we often lack access to the parameters' distribution.Therefore, our model is conditioned on the observation of a specific parameters' trajectory θ n .This defines the trajectory probability in the conditional view: where, P [θ n ] = p(θ 0 , θ t1 , . . ., θ tn ).
Similarly, the backward conditional trajectory probability is the probability of observing the time-reversal samples' trajectory, conditioned on observation of the time-reversal parameters' trajectory: .
We now use the Markov property in the Markov chains 11 and 12 respectively, to decompose the conditional trajectory probability and the marginal trajectory probability as fallows: where the expressions such as p(x tn |x tn−1 , θ tn ) and p(θ tn |θ tn−1 ) represent the transition probabilities that determine the probability of moving from one microscopic state to another.Additionally, we define two probability trajectories, conditioned on the initial conditions, which will be used later in the formulation of FT:

Local Detailed Balance (LDB) for learning PPMs
The transition probabilities, represented in 16, capture the physics of the learning problem.Considering a Markov property(i.e., memoryless process) for time evolution of the model subsystem, the transition rate for this subsystem get reduced to PPM: The above expression suggests that the transition rate between two microscopic states x ti−1 and x ti under the fixed θ ti , to be equivalent to probability of observing x ti by the PPM itself at t = t i .To reiterate, this is the Markov property that suggests the element inside x n , are independently and freshly drawn from the PPM specified with given parameters along the learning trajectory T .This is especially true where τ >> 1.We can generalize this observation for the backward transition probability p(x ti−1 |x ti , θ ti ), that represent probability of the backward transition (x ti , θ ti ) (x ti−1 , θ ti ) under fixed θ i , as follows: The above expression tells us that the probability of backward transition is equivalent with the probability of observing the sample generated at t = t i−1 in x n with the PPM at time t = t i .
Finally, we write the log ratio of forward and backward transitions: where the second equality is due to Eq. 1.The above expression resembles the celebrated Local Detailed Balance (LDB) [32] that relates the log ratio of forward and backward transition probabilities to the difference in potential energy of initial and final state in the transition.The heat reservoir that supports the legitimacy of the above LBD expression for learning PPM is the parameters subsystem, whose temperature has been set to one, as we will discuss it in more details in section 4. We emphasize that the above LBD has emerged naturally under assumption of the Markov property and a relaxation time for learning a generic generative PPM.It is also important to note that the above LBD is only valid in the low resolution picture. The This is significant because it renders the application of the FT framework to the learning PPMs practical, as we have access to elements of the learning trajectory.

L-info from fluctuation theorem
The version of the fluctuation theorem we are about to apply to the learning PPMs is known as the Detailed Fluctuation Theorem (DFT) [33].We also note that the machinery we are about to present for measuring information flow in PPMs has been developed to study information exchange between thermodynamic subsystems [9].The novelty here lies merely in the application of this machinery to the learning process of a PPM.In this section, we extensively use notations presented in table 2.1.Also, note that the temperature of the parametric reservoir is set to one.Applying DFT in the conditional view, i.e., the conditional forward and backward trajectories defined in Eq.21, results in: The first line is due to DFT, which defines the stochastic EP to be the logarithm of the ratio of the forward and backward trajectory probabilities.The second line is due to the decomposition presented in Eq. 17.Finally, the third line is the consequence of LDB relation ??, and the definition of the stochastic heat flow q xn (θ n ), as the change in the energy of the subsystem X due to alterations in its microscopic state configuration: Note that our sing convention defines q xn > 0 as the heat observed by the subsystem X.
The second law arises from averaging Eq.22 over the forward trajectory distribution P F [x n |θ n ], and recalling the non-negativity property of the Kl-divergence to establish non-negativity of averaged EP: We note that the averaged EP is still conditioned on the stochastic trajectory of parameters, thus we refer to this as the conditional EP.This is indeed the consequence of working in the conditional view.
Motivated to compute L-info, in the next step, we rearrange Eq. 22 as follows: where I[x tn : θ tn ] := s[p(x tn )] − s[p(x tn |θ tn )] is the mutual content (or stochastic mutual information) at t = t n .We now arrive at the conditional L-info 10 by averaging Eq. 24 over P F [x n |θ n ]: that defines the Partially Averaged (PA) quantities, We note that all PA quantities are conditioned on the parameters' trajectory, i.e., the choice of θ n from the ensemble.This is a direct consequence of working in the conditional view.However, this also signifies that all thermodynamic quantities mentioned above are computable in the practice of machine learning, as they only require access to the time evolution of one PPM.Fortunately, thanks to the low-variance condition 9, we can use the conditional L-info as proxy to the L-info, given that: Eq. 25, equates the (conditional) L-info to the difference between the Marginal EP, and the Conditional EP.We refer to this difference as the ignorance EP: It is important to note that both the Marginal EP and the Conditional EP measure the EP of the same process, which is the time evolution of the subsystem X.However, the conditional EP measures this quantity with a lower time resolution of α, that is conditioned on a specific parameters' trajectory.On the other hand, the marginal EP measures this quantity with a higher time resolution of δt, including the relaxation time of the subsystem X between each parameters' update.Therefore, the term "ignorance" refers to ignorance of the full dynamic of X, and the origin of L-info is the EP between each consecutive parameters' update, i.e., the EP of generating fresh samples represented with Markov chin 11.

M-info and the role parameters subystem
We can also apply the DFT to subsystem Θ: In the above expression, the second line is due to the decomposition in Eq. 17, and definition of the stochastic heat flow for parameter subsystem: Under the assumption that the subsystem Θ evolve quasi-statically, the EP of this subsystem is zero, as expected for an ideal reservoir.This result in q θn = ∆ tn s[p(θ t )].Furthermore, in the closed system of (X, Θ), the heat flow of the subsystem X must be provided with an inverse flow of the subsystem Θ, i.e., q xn (θ n ) = −q θn .Thus, we arrive at the stochastic version of Clausius' relation for the heat reservoir: This relation states that the heat dissipation in subsystem X (q xn (θ n ) < 0) is compensated with an increase of information in subsystem Θ.We recall the definition of M-info ?? as the entropy subsystem Θ.Since heat dissipation is a source of L-info accumulation (see Eq. 25), the above Clausius' relation states that this information is stored in the parameters by increasing the entropy of this subsystem, a.k.a. the M-info, confirming the role of parameters as the memory space of the PPM.
We can also take the ensemble average of Eq. 28 (i.e., averaging over P [x n , θ n ]): where is the fully averaged dissipated heat from the subsystem X.However, under the low-variance condition of learning ??, we expect Q X (θ n ) to be independent of choice of parameters' trajectory from the ensemble of computers.Thus, we can write

The ideal learning process
The learning objective necessitates an increase in L-info to enhance the model's performance while simultaneously reducing M-info to minimize generalization error and prevent overfitting.As previously mentioned in Section 2, the ideal scenario is achieved when all the stored information in the parameters (M-info) matches the task-relevant information learned by the model (L-info).Now that we have studied the machinery for computing these two information-theoretic quantities through the computation of entropy production, we can formally examine this optimal learning condition.
Maximizing L-info, as described in Eq. 25, is equivalent to maximizing the marginal EP while minimizing the conditional EP.Given that the conditional EP is always non-negative, the "ideal" scenario would involve achieving a conditional EP of zero, i.e., Σ X|Θ (t n ) = 0.This condition can be realized through a quasi-static time evolution of the PPM occurring on the lower-resolution timescale α, presented in the Markov chain 12.In the context of generative models, this condition is akin to achieving perfect sampling.Under these circumstances, all EP of the subsystem X transforms into L-info, resulting in Thermodynamically, the condition of quasi-static time evolution of the PPM (and consequently zero conditional EP) can be realized by having a large relaxation parameter τ ≫ 1, which allows the model to reach equilibrium after each optimization step.However, a high relaxation parameter comes at the cost of requiring more computational resources and longer computation times.This introduces a fundamental trade-off between the time required to run a learning process and its efficiency -a concept central to thermodynamics and reminiscent of the Carnot cycle, representing an ideal engine that requires an infinite operation time.

The parameters' reservoir
In the formulation of the previous section, we make this assumption that the subsystem Θ behaves as an ideal reservoir.
In this section, we get deeper on the premises of this assumption by studying the dynamic of the parameters subsystem.
To facilitate our formulation, we adapt negative log-likelihood as a fairly general form for the loss function: Here, the loss function is computed according to the empirical average of a random mini-batch b t ∈ B drawn from the training dataset at time step t.The last equality is due to the PPM defined in Eq. 1, and , where notation | • | shows the size of a set.We also use a vanilla Stochastic Gradient Descent (SGD) optimizer, with the learning rate r, to take gradient steps iteratively for n steps, in the direction of loss function minimization: To render the dynamic of parameters in the form of a conventional overdamped Langevin dynamic, we introduce the following conservative potential, defined by the entire training dataset B: The negative gradient of this potential gives rise to a deterministic vector force.Additionally, we define the fluctuation term, that represents the source of random forces due to selection of a mini-batch at time step t n : We now reformulate the SGD optimizer 31, in the guise of overdamped Langevin dynamics, dividing it by the parameters' update timescale α to convert the learning protocol into a dynamic over time: where µ := r/α is known as the mobility constant, in the context of Brownian motion.
We note that Eq. 33 is merely a rearrangement of the standard SGD.For us to interpret it as a Langevin equation, the term η(t n ) must represent a stationary stochastic process to serve as the noise term in the Langevin equation.To demonstrate this property of η(t n ), we must examine the characteristic of its Time Correlation Function (TCF) [34]: , where indices i, j represent different components of the vector θ, and δ i,j is the Kronecker delta.
If the fluctuation term, η, satisfies the condition of the white noise (uncorrelated stationary random process), and assuming that Eq. 33 describes a motion akin to Brownian motion, we can apply the fluctuation-dissipation theorem to write: Here, δ(t − t ′ ) is a delta Dirac, and the constant T symbolizes the temperature.The constant k B stands for the Boltzmann constant.To render our framework unitless, we treat the product of the Boltzmann factor and temperature as dimensionless.Moreover, regardless of the noise width we set T = 1, and henceforth it will not appear in our formulation.This is possible by adjusting the Boltzmann factor according to the noise width, i.e., k B = µ < η i (t)η i (t) > /2.C i,i (t, 0), for each mini-batch size scenario, underscoring the stationary nature of η.This part also highlights the role of mini-batch size in determining the noise width, i.e., the temperature of the environment.The horizontal dashed line indicates the maximum absolute value observed from ∇ θ U B (θ tn ), serving as a reference point for the magnitude of the noise.d) Exhibits the autocorrelation of the term η averaged over all parameters.For instance, computing this quantity at step 1000 reads: C i,i (t = 1000, t ′ − t).The rapid decline in autocorrelation with time lag indicating the white noise characteristic of η.
We still need to investigate if the fluctuation term indeed describes an uncorrelated stationary random process, as presented in Eq. 34.To this end, we conducted an experiment by training an ensemble of 50 models for the classification of the MNIST dataset.To induce different level of stochastic behavior, i.e., different "temperatures", we consider three different mini-batch sizes.A smaller mini-batch size leads to a bigger deviation in the fluctuation term, consequently amplifying the influence of random forces.Results are presented in Fig. 4.1.The plot 4.1c represents the TCF function at no time lag t = t ′ , i.e., variance of η(t), as a function of time.The constant value of variance suggests the stationary property of η(t).Moreover, Fig. 4.1d illustrates the autocorrelation of η(t) at different time lags, indicating white noise characteristic for this term.
However, it would be naive to draw a generic conclusion regarding the nature of the fluctuation term as an uncorrelated stationary random process solely based on a simple experiment.Indeed, research has demonstrated that the noise term can be influenced by the Hessian matrix of the loss function [35].This observation aligns with our definition of the fluctuation term presented in Eq. 33, where η is defined in relation to the gradient of the loss itself.Consequently, as the optimizer explorers the landscape of the loss function, the characteristics of the fluctuation term η can vary.We can grasp this concept in the context of Brownian motion by envisioning a Brownian particle transitioning from one medium to another, each with distinct characteristics.This implies that there could be intervals during training where η stays independent of the loss function and exhibits a stationary behavior.Moreover, we overlooked the fact that η(t) is also a function of θ itself.This could potentially jeopardize its stationary property.To address this issue, we refer to the slow dynamic (lazy dynamic) [28,29] of over-parameterized models under SGD optimization.This slow dynamic allows us to write the Taylor expansion1 of the loss function around a microscopic state θ * , sampled from its current state p t (θ): As a result, the gradient of the loss ∇ θ ϕ θt (b t ) = ∇ θ ϕ θ * (b t ), signifying an independent behavior from the specific value of the parameters θ t at a given time t.We can extend this concept to the deterministic force , which indicates a conservative force in lazy dynamics regime, denoted as F (θ * ).The key point here is that the value of this force is not dependent on the microscopic state of θ t , but rather on any typical sample, θ * , from Θ t .In Appendix 5, we illustrate how the condition of lazy dynamics leads to a thermodynamically reversible dynamic of the subsystem Θ.

Naive parametric reservoir
The stationary state of subsystem Θ, under the dynamic of Eq. 33, satisfying the fluctuation-dissipation relation in Eq. 34, corresponds to the thermal equilibrium state (the canonical state): where F Θ := − log( dθe −U B (θ) ) is the free energy of the subsystem θ.Recall that, the temperature has been set to one.This state, also, satisfies the detailed balance condition, that define the log ratio between forward and backward transition probability as follows: The standard plot of the loss function versus optimization steps in machine learning practice can help us to visualize the dynamics of the subsystem Θ.A rapid decline in the loss function signals a swift relaxation of the subsystem Θ to its equilibrium state.It is important to note that this self-equilibrating property is determined by the training dataset B through the definition of the potential function U B (θ).These swift and self-equilibrating properties mirror the characteristics of a heat reservoir in thermodynamics [30].Hence, we refer to the subsystem Θ as the parametric reservoir.After a swift decline, a gradual reduction of the loss function, can be sign of a quasi-statistic process, when subsystem Θ evolve from one equilibrium state to another.This can be due to the lazy dynamic condition, as discussed in Appendix 5. Additionally, the requirement of a high heat capacity for the reservoir, represented as dim(Θ) >> dim(X), offers a thermodynamic justification for the use of over-parameterized models in machine learning.

Realistic parametric reservoir
We refer to the assumption of the parametric reservoir with an equilibrium state expressed in Eq. 36 as the "naive assumption" due to several issues that were previously sidestepped.The first issue stems from the assumption that all components of the parameter vector θ are subject to the same temperature, i.e., < η i (t)η i (t) >= 2k B T µ for all index i.In practice, we might find different values of noise width, particularly with respect to different layers of a deep neural network.Furthermore, the weights or biases within a specific layer might experience different amounts of fluctuation.This scenario is entirely acceptable, if we consider each group of parameters as a subsystem that contributes to the formation of the parametric reservoir Θ.Consequently, each subsystem possesses different environmental temperatures and distinct stationary states.This observation may explain, in thermodynamic terms, why a deep neural network can offer a richer model.As it encompasses multiple heat reservoirs at varying temperatures, it presents a perfect paradigm for the emergence of non-equilibrium thermodynamic properties.
Second, the fluctuation term η may exhibit an autocorrelation property that characterizes colored noise, as presented in Ref [37].While this introduces additional richness to the problem, potentially displaying non-Markovian properties, it does not impede us from deriving the equilibrium state of the subsystem Θ, as demonstrated in [38].
We also overlooked the irregular behavior of the loss function, such as spikes or step-like patterns.These irregularities are considered abnormal as we typically expect the loss function to exhibit a monotonous decline, but in practice, such behaviors are quite common.These anomalies may be associated with a more intricate process, such as a phase transition or a shock, experienced by the reservoir.Nevertheless, we can still uphold the basic parametric reservoir assumption during the time intervals between these irregular behaviors.
The mentioned issues are attributed to a richer and more complex dynamic of subsystem Θ, and do not fundamentally contradict the potential role of subsystem Θ as a reservoir.Examples of these richer dynamics can be fined in a recent study [39], that shows the limitation of Langevin formulation of SGD, and Ref. [40] that investigates exotic non-equilibrium characteristic of parameters' dynamics under SGD optimization.
Before closing this section, it is worth mentioning that the experimental results presented in Figure 4.1 support the assumption of a low-variance condition for the stochastic dynamics of the subsystem Θ.For instance, panel (a) shows that even in the high noise regime (|b t | = 1), the dynamics of parameters remain confined to a small region across the ensemble.Furthermore, panel (b) demonstrates the low-variance characteristics of the model's performance accuracy.Finally, the large magnitude of deterministic force (dashed line in panel (c)) to random force, is an evidence of low-variance dynamics.

Figure 3 . 2 :
Figure 3.2:This figure shows Bayesian network for joint trajectory probability P [x n , θ n ], based on a dual timescale bipartite dynamics.

Figure 4 . 1 : 1 dim
Figure 4.1: This experiment contrasts the parameter dynamics with three different mini-batch sizes: |b t | = 1, |b t | = 10 and |b t | = 100.The model under consideration is a four-layer feedforward neural network with a uniform width of 200 neurons.It was trained on the MNIST classification task using a vanilla SGD optimizer.The experiment was replicated over 50 trials to generate an ensemble of parameters.a) One random parameter from the model's last layer is chosen for each batch size scenario, and four of its dynamic realizations are depicted.b) Illustrates both the average accuracy (solid line) and the variance of accuracy within the ensemble (shaded area), emphasizing the low-variance condition, which asserts that macroscopic quantities such as accuracy have low variance statistics across the ensemble.c) Displays the noise variance averaged over all parameters, i.e., 1 dim(θ) dim(θ) i=0

Table 2 .
1: A list of notations used in this paperIn the context of the learning problem, the DPI as referenced in 4 suggests that what is Memorized is always greater than or equal to what is Learned.The L-info metric is task-oriented.For example, in the realm of image generation, it quantifies the statistical resemblance between the model's outputs and the genuine images.In the case of classification task, L-info would encapsulate only the pertinent information for label prediction.In contrast, M-info can encompass information not directly pertinent to the current task.For instance, it might capture intricate pixel configurations in an image dataset, which aren't crucial for identifying distinct patterns like human faces.The DPI 4 neatly illustrate the risk of overfitting, when a model starts to incorporate extraneous information that doesn't align with the learning objective.