Next Article in Journal
Internal Energy, Fundamental Thermodynamic Relation, and Gibbs’ Ensemble Theory as Emergent Laws of Statistical Counting
Next Article in Special Issue
The Statistical Thermodynamics of Generative Diffusion Models: Phase Transitions, Symmetry Breaking, and Critical Instability
Previous Article in Journal
Joint Communication and Channel Discrimination
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Control of Overfitting with Physics

by
Sergei V. Kozyrev
1,*,
Ilya A. Lopatin
1 and
Alexander N. Pechen
1,2
1
Steklov Mathematical Institute of Russian Academy of Sciences, Gubkina St. 8, Moscow 119991, Russia
2
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Alexandra Solzhenitsyna Str. 25, Moscow 109004, Russia
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(12), 1090; https://doi.org/10.3390/e26121090
Submission received: 15 October 2024 / Revised: 7 December 2024 / Accepted: 10 December 2024 / Published: 13 December 2024
(This article belongs to the Special Issue The Statistical Physics of Generative Diffusion Models)

Abstract

:
While there are many works on the applications of machine learning, not so many of them are trying to understand the theoretical justifications to explain their efficiency. In this work, overfitting control (or generalization property) in machine learning is explained using analogies from physics and biology. For stochastic gradient Langevin dynamics, we show that the Eyring formula of kinetic theory allows to control overfitting in the algorithmic stability approach—when wide minima of the risk function with low free energy correspond to low overfitting. For the generative adversarial network (GAN) model, we establish an analogy between GAN and the predator–prey model in biology. An application of this analogy allows us to explain the selection of wide likelihood maxima and ab overfitting reduction for GANs.

1. Introduction

Analogies from physics and other fields, particularly population genetics, are of interest when studying problems in machine learning theory. Analogies between machine learning theory and Darwinian evolution theory were discussed already by Alan Turing [1]. Biological analogies in computing were discussed by John von Neumann [2]. Physical models in relation to computing were discussed by Yuri Manin [3]. Such analogies allow physical intuition to be used in learning theory. Among the well-known examples are genetic [4] and evolutionary algorithms [5], models of neural networks and physical systems with emergent collective computational abilities and content-addressable memory [6], a parallel search learning method based on statistical mechanics and Boltzmann machines that mimic Ising spin chains [7]. A phenomenological model of population genetics, the Lotka–Volterra model with mutations, related to generative adversarial network (GAN) was introduced in [8]. Analogies between evolution operator in physics and transformers (an artificial intelligence model) were discussed in [9]. Ideas of thermodynamics in application to learning were considered in [10,11] and in relation to the evolution theory in [12,13]. GANs have found many applications in physics, e.g., for designing the process of electromagnetically induced transparency metasurfaces [14], etc. Physics-informed neural networks were developed for solving the optima quantum control of open quantum systems [15]. Beyond that, the theory of open systems inspired the development of quantum reinforcement learning, where states of the agent are quantum states of some quantum system and the dynamic environment is a quantum environment surrounding the manipulated system to optimize the reward [16,17], which is connected to the general incoherent quantum control paradigm [18]. Various approaches to quantum machine learning were proposed, e.g., [19]. Open quantum systems were suggested to be applied to spin glasses [20], network science [21], and finance [22]. Quantum neural networks were proposed to be applied for some problems, e.g., to work with encrypted data to protect the privacy of users’ data and models [23].
Within the framework of such analogies, it is natural to discuss biological analogs for various models of the machine learning theory. In this work, two analogies are considered, between the stochastic gradient Langevin dynamics (SGLD) in machine learning and the Eyring formula in the kinetic theory [24], and between the GAN model [25] and the mathematical predator–prey model in biology, where we suggest to consider the discriminator and generator in GAN playing the role of prey and predator, respectively. The proposed analogies allow us to explain the efficiency of controlling overfitting, which is the lack of generalization abilities for a machine learning approach. It is known that for stochastic gradient descent (SGD), overfitting is reduced; for the GAN, this effect of reducing overfitting is even more significant. We propose to explain overfitting control in these processes within the framework of the algorithmic stability approach by suppressing narrow minima of the empirical risk function.
In Section 2, we consider the stochastic gradient Langevin dynamics (SGLD) using a stochastic differential equation (SDE). We show that the reduction of overfitting in such a model follows from ideas used in chemical kinetics such as the Eyring formula, which states that the reaction rate (the rate of the transition between two potential wells due to diffusion) is determined by free energy of the transition state (the saddle between two potential wells) and the free energy of the initial state of the reaction (optimization of quantities involving entropy-dependent Helmholtz free energy also appears in quantum optimization, e.g., [26]).
In Section 3, we describe the minimax problem for the GAN model by a system of two stochastic differential equations (one is for the discriminator and another is for the generator). In this sense, the GAN model is a two-body generalization of the SGLD model considered in Section 2. However, this generalization significantly changes the behavior of the learning system, as demonstrated by the simulations below. We show that this model implements a selection of wide maxima of the likelihood function, leading to a reduction of overfitting. A biological interpretation of the GAN model is provided in terms of the interaction between a predator and a prey (the predator is the generator, the prey is the discriminator). Learning for GANs by solving a system of ordinary differential equations was considered in [27,28,29].
In Section 4, we introduce a generalization of the GAN for a type of population genetics model. We consider a branching random process with diffusion for two types of particles (discriminators and generators), in which discriminators and generators can replicate. In this case, the rates of replication and death of particles depend on the contributions to the functional, minimax, which is defined by the GAN model.
In Section 5, we provide the numerical simulations illustrating the behavior of the stochastic gradient descent and the predator–prey model of the minimax problem for the GAN for a simple potential with two wells. We observe the following two regimes: pushing out of the narrower well and oscillations in the wider well. One parameter of the interaction allows to control the transition between these two regimes. The third regime, when escape out of the both wells occurs, is possible; however, this regime can be avoided by adjusting the parameters.
Some relevant notions from the theory of random processes are provided in Appendix A. Section 6 summarizes the results.

2. Stochastic Gradient Descent and the Eyring Formula

2.1. Stochastic Gradient Descent

Let { z l } , l = 1 , , L (with the elements belonging to R n ), a loss function f l ( x ) = L ( z l , x ) for the l-th sample object, and hypothesis x (we assume that the hypothesis space is R m ). Minimization of the empirical risk is the following problem:
f ( x ) = 1 L l = 1 L f l ( x ) min x .
Gradient descent algorithm for this problem, when the loss function belongs to the space of continuously differentiable functions C 1 ( X ) , is defined by the solution of the differential equation d f d t = f ( x ) x , which is numerically described by the iterative process starting from some initial guess x 0 , such that
x k + 1 = x k α k f ( x k ) = x k α k 1 L l = 1 L f l ( x k ) ,
where α k is a gradient step [which is also often called as learning rate (lr)] at the kth iteration. There are different methods for introducing stochasticity in training for the problem (1), such as mini-batch learning or dropout. In this paper, we consider a method based on adding Gaussian noise to (2), which is related to stochastic gradient Langevin dynamics.
Stochastic gradient Langevin dynamics encompasses the modification of the above procedure where small independent random perturbations are added at each step of gradient descent, such that
x k + 1 = x k + w k α k f ( x k ) ,
where the set of all w k is a set of independent Gaussian random vectors.
This procedure can be considered as a discrete-time version of the stochastic equation
d ξ i ( t ) = 2 θ d w i ( t ) f ( ξ ( t ) ) x i d t ,
where d w i ( t ) is the stochastic differential of the Wiener process (factor 2 is used for convenience to have unit instead of 1 / 2 in front of θ Δ in Equation (5) below).
The stochastic gradient Langevin dynamics procedure with an equation of the form (4) was discussed in [30,31,32,33].
Diffusion equation in the potential is the partial differential equation
u t = θ Δ u + u · f + u Δ f ,
where x R d , u = u ( x , t ) is the distribution function, f = f ( x ) is the potential, f C 2 ( R d ) , and θ > 0 is the temperature.
Equivalently, this diffusion equation can be rewritten as (where β = 1 / θ is the inverse temperature)
u t = θ div e β f grad u e β f .
Gibbs distribution e β f is a stationary solution of this equation. The solution converges to the Gibbs distribution under certain conditions on f, as discussed in [34].
Diffusion Equation (5) is the Fokker–Planck equation (see (A2) in Appendix A) for stochastic gradient Langevin dynamics (4).
We would like to highlight the following: The well-known stable diffusion neural network and, in general latent, diffusion models [35,36] do use diffusion described by a stochastic differential equation (as in SGLD). However, the general formulation of the problem is different there, where a special type of diffusion is used for generating objects (particularly images), while we are considering the diffusion (SGLD) for learning in a potential.

2.2. Overfitting Control for Stochastic Gradient Descent

Overfitting is the lack of ability to generalize for the solution of the learning problem (i.e., high likelihood on the training sample and low likelihood for the validation sample). One approach to overfitting control is based on the algorithmic stability, i.e., on the stability of the solution obtained by a learning algorithm to perturbations of the training sample [37,38,39]. In this case, narrow (sharp) minima of the empirical risk functional (in the hypothesis space) are associated with overfitting, and wide (flat) minima correspond to solutions of the learning algorithm without overfitting [40].
Introducing noise into the gradient descent procedure, i.e., considering SDE (4) and diffusion (5), is related to the problem of overfitting as follows: The Eyring formula, which is a generalization of the Arrhenius formula, describes the reaction rate (the rate of transition between two potential wells due to diffusion of the form (5)) in the kinetic theory: the reaction rate is proportional to
e β ( F 1 F 0 ) ,
where F 1 is the free energy of the transition state (the saddle between two potential wells), and F 0 is the free energy of the initial state of the reaction (the potential well from which the transition occurs). The free energy of a state is F = E β 1 S , where E is the energy and S is the entropy of the state. In general, free energy of a set U (say a potential well) is defined as
e β F ( U ) = U e β E ( x ) d x .
The connection between the Arrhenius and Eyring formulas and the spectral asymptotics for the Schrödinger operator corresponding to the subbarrier (tunnel) transition between two potential wells was discussed in [34].
Let us emphasize that we do not assume any minimization procedure—stochastic gradient Langevin dynamics (4) generate the Gibbs distribution according to diffusion Equation (5), and this Gibbs distribution will be concentrated in potential wells with low free energy. In the view of the above discussion, learning with stochastic gradient Langevin dynamics is a search for a potential well that gives the global minimum of free energy. The influence of the temperature is important when comparing the entropy and the energy parts of free energy. For the effect of particle capture by a well (i.e., learning) to take place, it is important that the temperature should be significantly (e.g., by several times) less than the difference in the free energies of the well and the saddle; hence, the temperature is important for the SGLD. Thus, for successful learning, the temperature should be low enough to make the capture of the particle by a potential well possible.
The Eyring formula (6) implies that, for equal free energies of the transition state and equal energies of the initial state, the transition rate will be lower (the free energy of the initial state will be lower) for potential wells with higher entropy (wider ones). Thus, stochastic gradient Langevin dynamics (4) will correspond to a regime in which wider wells (with higher entropy) more effectively capture the learning system, i.e., algorithmically stable solutions to the learning problem will be selected during stochastic gradient optimization (here, we consider the SGLD procedure but we believe the same effect will hold for other forms of SGD, such as mini-batch procedure).
The validity of the proposed approach is limited by the assumptions for the Eyring formula. The Eyring formula describes transitions for a diffusion equation in the potential and might be applied to various complicated high-dimensional landscapes having clear minima and saddles between them. Its validity under quite general conditions has been justified in [41]. For landscapes which might not exhibit clear minima and saddles between them, the proposed approach based on the Eyring formula may not work.
Physical analogies in machine learning, particularly the application of free energy, were discussed in [10]. In [11], the following procedure was considered: the empirical risk functional (depending on the hypothesis x) was replaced by the so called “local entropy” functional, which looks like minus free energy of some vicinity of x (where energy is the empirical risk). In this way, the wider empirical risk minima will correspond to deeper minima of the new “local entropy” functional. Relation to the generalization property in the approach of [40] (flat minima) was discussed (although the authors of this paper do not discuss the Eyring formula).

3. The GAN Model and Overfitting

3.1. Stochastic Gradient Langevin Dynamic for GAN

The generative adversarial network (GAN) model is a minimax problem, such that [25]
min y max x V ( x , y ) ;
V ( x , y ) = 1 L l = 1 L log D ( z l , x ) + Z p gen ( z , y ) log ( 1 D ( z , x ) ) d z ;
{ z l } is a sample, where z l Z , D ( z , x ) and p gen ( z , y ) are parametric families of probability distributions on the space Z, called the discriminator and the generator, with parameters x and y from statistical manifolds X and Y (which we will assume to be real vector spaces).
In [25], the generator was considered as a parametric family of mappings from some auxiliary space to Z. These mappings transferred the probability distribution on the auxiliary space to Z. The discriminator was described by a distribution on the same space as the data, and the interpretation was as follows: the discriminator outputs binary variable “one” or “zero” given the data and given the generator according to this distribution (i.e., outputs “one” if the discriminator considers these inputs as correct).
The first contribution to V ( x , y ) in (8) is the log-likelihood function. The second contribution behaves qualitatively as the minus inverse of the Kullback–Leibler distance (Kullback–Leibler divergence, KL distance, or KLD) between the distributions of the discriminator and the generator (that is, this contribution is negative, large in magnitude for small KL–distances, and grows to zero for large KL–distances), as was mentioned in [25].
V ( x , y ) = V 1 ( x ) + V 2 ( ρ ( x , y ) ) .
Here
ρ ( x , y ) = KL ( D ( x ) | p gen ( y ) ) ,
where the Kullback–Leibler distance between probability distributions p and q is defined as
KL ( p | q ) = Z p ( z ) log p ( z ) q ( z ) d z .
The minimax for V ( x , y ) over x, y is obtained from the local maximum of V 1 ( x ) over x. Transitions between the local maxima of V 1 ( x ) generate transitions between local minimaxes of V ( x , y ) (the generator follows the discriminator) as shown below.
The stochastic gradient Langevin dynamics optimization for the problem (7), (8) can be described by a system of SDEs defining random walks ξ = ( ξ i ( t ) ) , i = 1 , , m , η = ( η j ( t ) ) , j = 1 , , n on statistical manifolds X and Y of the discriminator and generator,
d ξ i ( t ) = 2 θ d w i ( t ) + d t x i V ( ξ ( t ) , η ( t ) ) ,
d η j ( t ) = 2 θ d v j ( t ) d t y j V ( ξ ( t ) , η ( t ) ) .
Here, x i V and y j V are derivatives of V with respect to the first and second arguments, respectively; w i ( t ) and v j ( t ) are Wiener processes on the parameter spaces of the discriminator and generator. This system of stochastic equations follows directly from the minimax condition and is independent from the KL distance formulation. However, the KLD formulation is used later for analyzing the behavior of the solution. In this system of SDEs, the discriminator seeks to maximize the function V ( x , y ) (8) with respect to x, and the generator seeks to minimize this function with respect to y.
Example 1.
Let us consider one-dimensional parameters x and y for the discriminator and for the generator, respectively, and functional V = ω x y with minimax located at the origin. The noiseless GAN equation system is
d x d t = x V ( x , y ) = ω y , d y d t = y V ( x , y ) = ω x .
Its solution is
x = A sin ω ( t t 0 ) , y = A cos ω ( t t 0 )
with oscillations around the minimax.
In [27,28,29], the convergence of the optimization of the GAN model by the gradient descent method with respect to the parameters of the discriminator and generator in the neighborhood of the functional’s local minimax was studied, and oscillations of the parameters were discussed.

3.2. Overfitting Control for GAN

If we ignore the presence of the generator, then the dynamics of the discriminator (9) for optimization with noise will correspond to the diffusion in the potential generated by the data. Thus, the arguments of Section 2 will be applicable. Therefore, overfitting can be reduced according to the Eyring formula.
The presence of the generator will further suppress overfitting. The minimax problem for the GAN (7) can be described as follows. The discriminator (9) tries to reach regions of the parameter x with high values of V ( x , y ) . The generator (10) tries to reach regions of the parameter y with low values of V ( x , y ) . In this case, the contribution to V ( x , y ) (8) from the likelihood function depends only on the parameters of the discriminator, i.e., the discriminator tries to increase both contributions to (8), and the generator tries to decrease only the second contribution. The second contribution to (8) decreases at small Kullback–Leibler distances between the discriminator and the generator.
Therefore, the compromise between the optimization problems for the discriminator and the generator will be achieved when they are located at maxima of the contribution from the likelihood function to (8) which are sufficiently wide in the space of the parameters ( x , y ) (where the average KL–distance between the discriminator and the generator is not too small). Selecting wide maxima, in accordance with the algorithmic stability approach, will reduce the effect of overfitting.
Here, we propose a biological prey–predator interpretation for the GAN model, which is completely different from the interpretation used in [25]. In our interpretation, the discriminator is herbivore (prey), the generator is predator, and the data are grass. Then, the minimax problem (7) describes the situation when the discriminator searches for grass (the maximum of the likelihood function; this corresponds to an increase in the first contribution in (8)) and also runs away from the predator (this corresponds to an increase in the second contribution to (8)), while the predator chases the prey (and hence decreases this contribution). In our interpretation, the discriminator (herbivore)—as a distribution—tries to get closer to the data distribution (grass) and farther from the generator (predator) as a distribution (in the KL distance sense), and the generator tries to be closer to the discriminator. As a result of this interaction, the generator also moves toward the data (grass) because herbivores (discriminator) are likely to be found there; however, this does not mean that the predator tries to imitate grass (this is a mixture of the two interpretations of the GAN). Minimization in (7) for the generator forces the predator to move to fields (or meadows, likelihood maxima) where the discriminator is present. The interaction of the two contributions to (8) forces the discriminator to search for sufficiently wide meadows (likelihood maxima) where the average KL–distance from the predator is not too small. In general, the predator pushes out the prey from narrow fields of grass, and both the prey and predator move to wide grass fields. Thus, the GAN model implements the selection of wide likelihood maxima, which reduces overfitting.
Simulations illustrating the discussed above behavior are considered in Section 5 below.

4. Branching Random Process for GAN

In this section, a branching random process with diffusion and particle interactions describing the populations of discriminators and generators in a generalization of the GAN model is introduced.
The theory of branching random processes and its connection with population genetics have been actively discussed in the literature, for example, in [42,43]. Previously, in [8], a generalization of the GAN model related to population genetics (a Lotka–Volterra-type model with mutations), was discussed. In this model, discriminators and generators could reproduce and form populations. The phenomenological equations of population dynamics were considered, and the suppression of overfitting was discussed.
Consider a generalization of the GAN to the case of several discriminators (particles in the hypothesis space of the discriminator with parameter x, particles are indexed by a) and generators (particles in the hypothesis space of the generator with parameter y, particles are indexed by b). The analog of the SDE system (9) and (10) will take the form
d ξ ( a ) i ( t ) = 2 θ d w ( a ) i ( t ) + d t x i V ( ξ ( a ) ( t ) , η ( t ) ¯ ) ;
d η ( b ) j ( t ) = 2 θ d v ( b ) j ( t ) d t y j W ( ξ ( t ) ¯ , η ( b ) ( t ) ) ;
where each particle is associated with its own independent Wiener process w ( a ) i ( t ) , v ( b ) j ( t ) on the right-hand side of the equation in the discriminator and generator spaces, respectively, and the terms with interaction on the right-hand sides of the equations have the form
V ( x , y ¯ ) = 1 L l = 1 L log D ( z l , x ) + b Z p gen ( z , y b ) log ( 1 D ( z , x ) ) d z = V 1 ( x , { z } ) + V 2 ( x , y ¯ ) ;
W ( x ¯ , y ) = a Z p gen ( z , y ) log ( 1 D ( z , x a ) ) d z .
Here, V 1 ( x , { z } ) is the likelihood function for the discriminator x.
This corresponds to a GAN-type model with functional V ( x , y ¯ ) for discriminator x and functional W ( x ¯ , y ) for generator y. Equations (11) and (12) describe optimization by the stochastic gradient Langevin dynamics. Each discriminator interacts with a set of generators and similarly, each generator interacts with a set of discriminators. Here, x ¯ = { x 1 , , x M } and y ¯ = { y 1 , , y N } are sets of discriminators and generators, respectively. The second contribution V 2 ( x , y ¯ ) and the function W ( x ¯ , y ) contain sums from contributions that behave qualitatively as KL ( x , y ) 1 , where KL ( x , y ) is the Kullback–Leibler distance between discriminators and generators with parameters x and y.
Let us define a model which mimics the population genetics, defined by a branching random process with diffusion and interaction with particles of two types ξ ( a ) i ( t ) , η ( b ) j ( t ) (discriminators and generators), which can perform random walks in accordance with Equations (11) and (12), and have the ability to replicate and die, with the probabilities of such processes depending on the functionals (13) and (14). The replication of a particle consists of replacing it with two particles of the same type with the same coordinates (which can then perform random walks in accordance with (11) and (12)).
We propose to use the following branching rates (as related to Lotka–Volterra-type model discussed in [8]): the death rate of generators is considered as fixed, while the replication rates of the generators η ( b ) ( t ) are proportional to (recall that both W and V 2 are negative)
W ( ξ ( t ) ¯ , η ( b ) ( t ) ) ;
the replication rates of discriminators ξ ( a ) i ( t ) are proportional to
exp V 1 ( ξ ( a ) i ( t ) , { z } ) ;
the rate of death of the discriminator ξ ( m ) i ( t ) is proportional to
V 2 ( ξ ( a ) i ( t ) , y ¯ ) .
Thus, discriminators replicate depending on the data and die depending on the generators. Generators replicate depending on the discriminators and die at a constant rate.
The biological interpretation of the proposed model is the following. The data { z l } are the distribution of grass, the discriminators ξ ( a ) i ( t ) are herbivores (prey), the generators η ( b ) j ( t ) are predators; herbivores reproduce on grass, and predators hunt herbivores. The effect of suppressing overfitting on narrow likelihood maxima looks as follows: if the discriminator has replicated on the likelihood maximum (on its statistical manifold X), the generator will tend to go there and replicate there (the generator will tend to the corresponding regions of its statistical manifold Y, such that the KL–distance between D ( · , x ) and p gen ( · , y ) is small). In this case, for a narrow likelihood maximum, the average KL–distance will be small, i.e., the predator will eat the prey more effectively (and then suffer from hunger) than for a wide maximum. This is how the effect of selective suppression of narrow population maxima in X and Y corresponding to narrow likelihood maxima is realized. For the case of the population genetics model for the GAN (where discriminators and generators can replicate), the effect of overfitting control is more pronounced than for the standard GAN model (without replication).

5. Simulations

In this section, the results of the numerical simulation of the SGLD procedure and the simulation of the predator–prey model for the GAN are provided.

5.1. Objective Function

Let L ( x ) be an objective function for optimization, L ( x ) max . We consider L ( x ) as a sum of non-normalized Gaussians of the following form:
L ( x ) = j = 1 n q j e x c j 2 2 σ j 2 ,
where q j and σ j are some real-value positive constants, x , c j R d . For visualization, we use d = 2 . Constant σ j can be interpreted as a characteristic width of the extremum. We consider the case when σ k c i c j for all k , i j , so that L ( x ) has n separated extrema.
This objective function has two hills around the two extrema (or wells for the minimization of L ( x ) ; for definiteness, we call them below as wells).

5.2. Stochastic Gradient Langevin Dynamics

For objective function (15), we consider n = 2 , σ 1 = 3.0 ,   σ 2 = 1.5 , c 1 = 5.5 , 5.5 T ,   c 2 = 3.0 , 3.0 T , q j = σ j 2 for j = 1 , 2 (here and in similar places below, T means transpose of the vector, not temperature). Thermal plot of the function L and its gradient field are shown on Figure 1. As the starting point, set 0.0 , 0.0 T . Note that x L ( 0 , 0 ) has positive coordinates. Therefore, standard gradient descent procedure starting from the point x 0 = 0.0 , 0.0 T will converge to c 2 , which is the well with a smaller width. Consider standard stochastic gradient procedure
x k + 1 = x k + α x L ( x k ) + ξ k , x 0 = ( 0 , 0 ) T ,
where ξ k are two-dimensional independent random variables,
ξ k ( N ( 0 , T ( 1 + k ) 1 / 2 ) , N ( 0 , T ( 1 + k ) 1 / 2 ) ) ,
where N is the normal distribution with center at the origin and with variance T ( 1 + k ) 1 / 2 , T is temperature parameter, and α is a fixed learning rate. The scaling ( 1 + k ) 1 / 2 is introduced to ensure convergence of the SGLD process.
We consider 15 values of temperature T uniformly spanned on [ 0 , 0.8 ] . For each value of T, we run the SGLD starting from the same initial point x 0 with a maximal number of iterations K = 2000 and select those runs which in no more than K iterations converge to one or another well (in their σ -vicinities). Then, we compute the fraction of the runs which converge to the wider well. For T = 0 , all runs converge to the well with a smaller width. According to the Eyring formula (6), we expect an increase in the fraction of the SGLD’s runs from the starting point x 0 with increasing temperature T, which converge to the well c 1 with a larger width. Figure 2 confirms this behavior.
The efficiency of SGLD depends on such hyperparameters as temperature and learning rate, which influence both the convergence behavior and the stability of the solution SGLD. Since our analysis is limited to the use of a stochastic differential equation, we just use a sufficiently low learning rate value without a detailed analysis and tuning of its value. The conditions on the learning rate are that it should be sufficiently small to guarantee not escaping the target well and also it should be not very small, since if it is very small, it will require a large number of iterations. Temperature is relevant for our consideration. An increase in the temperature affects the convergence and stability of the SGLD because if the temperature is too high, fluctuations can throw the trajectory beyond the minimum. To demonstrate this, we plot the dependence of the trajectories’ fractions on Figure 3 that converge to one of the two extrema on the number of iterations for different temperatures and for a single learning rate.

5.3. Predator–Prey Model

For the predator–prey model, we consider a more general dynamical system than the system defined by Equations (9) and (10). Let x ( t ) be position of the prey and y ( t ) be position of the predator at time t. In simulations for visualization, we consider x , y R 2 . We consider their joint evolution as governed by the system of equations
d x d t = L ( x ) + V ( x ( t ) y ( t ) ) , d y d t = W ( x ( t ) y ( t ) ) ,
where V and W are some vector functions (forces) describing the interaction between the prey and the predator (in this subsection, they are not the same as functions V and W considered in the previous sections). Informally speaking, x ( t ) tries to evolve in a way to simultaneously maximize the objective function L and the distance to the predator, while y ( t ) tries to evolve in a way to minimize the distance to the prey. For W, we choose
W ( x y ) = α y x y x y ,
which defines the motion of y with a constant speed α y towards x.
A key point in our analysis is to find suitable conditions on the vector function V ( x ( t ) y ( t ) ) , which depends on the difference x ( t ) y ( t ) . Let d = x ( t ) y ( t ) be the distance between x ( t ) and y ( t ) . We suggest the potentials to have the following general behavior at various distances:
  • At short distances, d < σ min , where σ min is an estimate of the minimum acceptable width of the well (i.e., some width from which we do not consider the well to be narrow). For this distance, we assume V 1 to allow a predator to push the prey out of the well.
  • At intermediate distances, σ min < d < σ max , where σ max is the estimate from above of the width of the well of L ( x ) . For these distances, we assume V W . This condition is introduced to have oscilations in sufficiently wide wells.
  • At long distances, d > σ max , we assume V 0 to guarantee convergence to some well of L ( x ) .
As an explicit potential which satisfies these conditions, we take
V ( x y ) = A 1 + e c ( d l ) + C e σ d d x y d , d = x y .
with some parameters A , C , σ , c , l > 0 . It contains the Yukawa potential e σ d d as a summand. The parameter l can be interpreted as a characteristic intermediate distance and A as a mid-range predator, as shown in Figure 4. The norm of the interaction vector function V along with its characteristic points for the parameter values described below is plotted in Figure 4.
We assume that by tuning the parameters of the interaction vector functions V, we can push the system out of narrow wells of L . To estimate these parameters, we consider the case of limiting oscillations around a radially symmetric well. Schematically, such oscillations can be represented in Figure 5. Note that this behavior is not stable for small values of θ due to the discreteness of the step also known as learning rate. Let R x and R y be distances between the extremum and x or y, respectively, and let θ be the angle between the predator–prey line and the radial line. In our simulations, the smallest, critical angle value is about θ π / 18 . For this behavior with limiting oscillations and radially symmetric structure, we have the following relation:
R x R y = x L ( x ) + V ( x , y ) ) x = R x , x y = d α y , x L ( x ) x = R x = V ( x , y ) x y = d , R y 2 = R x 2 + d 2 2 R x d cos ( θ ) .
To enable pushing out of narrow wells and oscillations inside sufficiently wide wells, the constants α y , A , l , c , C , σ should satisfy the following heuristic conditions:
  • The constant A should be small enough not to generate a too strong pushing-out potential; otherwise, the prey would escape all the wells.
  • The constant A should satisfy A > α y , so that x can keep at some distance from y.
  • The constant l should correspond to a sufficient width of the well. If l is too large, the non-convergence to any well can occur because x will enter the regime of running away from y. If l is too small, the dynamics will not have oscillations in the wider well.
  • The constant σ in Equation (18) determines the minimal width of a well. The width of the well from which there will be pushout due to the short-range Yukawa potential is defined as ∼ σ 1 (up to some constant).
  • The constant C must be chosen to be large enough so that the Yukawa potential creates a repulsion stronger than the attraction of the gradient L near the narrow well.
  • The parameter c characterizes the rate of the transition from intermediate to long distances; its value is taken to be large enough. With increasing c, the norm V ( d ) of the vector function V ( d ) tends to be more step-like with a gap at d = l .
A typical evolution is shown in Figure 6 (Video S1 available in Supplementary Material). The model parameters for this simulation are the following. Centers of the two Gaussian extrema in (15) ( n = 2 ) are c 1 = ( 0 , 0 ) , c 2 = ( 7.0 , 7.0 ) , widths σ 1 = 0.5 , σ 2 = 2.0 , and amplitudes q 1 = 0.25 , q 2 = 8.0 . The parameters of the function V defined by Equation (18) are the following: A = 0.3 , l = 1.0 , c = 10 3 , C = 10.0 , σ = 10.0 . Starting points for the prey and for the predator are ( 0.5 , 0 ) , ( 0 , 2.0 ) , respectively.
We observe the possibility of having the following two regimes: pushing out of the narrower well and oscillations in a wider well. One of the parameters of the interaction vector functions controls the transition between these two regimes. It has the meaning of maximal width of the well from which pushout is expected. The third regime, when escape out of the both wells occurs, is possible, but we were able to overcome this regime by adjusting the parameters.

5.4. Application to Wine Recognition Dataset

In Section 5, we applied the predator–prey model to some synthetic conditions to reveal the desired behavior. While the application of the method to real big datasets is a separate complex task, here we investigate the improvement achieved by the suggested method on an educational small dataset. As an example, we consider the Wine recognition dataset from scikit-learn library (https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset, accessed on 10 October 2024). This dataset consists on 178 instances with 13 numeric attributes. Target is class attribute which is encoded by numbers 0 , 1 , 2 . To make the overfitting process more visible and for simplicity of having overfitting, we consider this dataset as a regression task. Linear regression is a good solution for this dataset; however, to create an overfitting situation, we consider quadratic regression with ordinary gradient descent and compare it with the predator–prey model. We randomly split the dataset on train and test sets (train is 80 % ) and run gradient descent for linear and quadratic regression and predator–prey model for quadratic regression. The obtained results, summarized in Table 1, show that the predator–prey model allows to improve the learning results. Running with other random splittings of the dataset into train and test subsets, which show similar results in general. However, several differences may occur. First, linear regression sometimes shows better results than quadratic regression. Second, at first, iteration gradient descent sometimes occurs faster that the predator–prey model convergence to some minimum, but then GD goes up and starts overfitting, while the predator–prey model oscillates around the minimum that avoids overfitting. Thus, in general, our finding on this example is that the predator–prey model works better and sometimes much better to reduce the effect of overfitting.

6. Conclusions

Various mimics of physical or biological behavior do appear in machine learning, e.g., in evolutionary and genetic algorithms. In this work, we discuss a possible justification, based on some models appearing in physics and biology, for the ability to control overfitting in SGLD and the GAN. For SGLD, we show that the Eyring formula of the kinetic theory allows to control overfitting in the algorithmic stability approach, when wide minima of the risk function with low free energy correspond to low overfitting. We also establish a relation between the GAN and the predator–prey model in biology, which allows us to explain the selection of wide likelihood maxima and overfitting reduction for the GAN (the predator pushes the prey out of narrow likelihood maxima). We performed numerical simulations and suggested conditions on the potentials which would imply such behavior.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/e26121090/s1, Video S1: Simulation of the predator-prey model with two extrema of different widths.

Author Contributions

Conceptualization, S.V.K.; investigation, S.V.K., I.A.L. and A.N.P.; software, I.A.L.; writing—review and editing, S.V.K., I.A.L. and A.N.P.; visualization, I.A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Russian Federation, grant number 075-15-2024-529.

Data Availability Statement

All the input data used during this study and all the computed data during this study are written or shown in the figures in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GDGradient descent
SGDStochastic gradient descent
GANGenerative adversarial network
SDEStochastic differential equation
SGLDStochastic gradient Langevin dynamics
lrlearning rate

Appendix A

Here, we provide some relevant notions from the theory of random processes [44].
Fokker–Planck equation. Consider a diffusion with a generator
L ^ f ( x ) = 1 2 i j a i j ( x ) 2 f ( x ) x i x j + i b i ( x ) f ( x ) x i ,
and the adjoint operator
L ^ * f ( x ) = 1 2 i j 2 a i j ( x ) f ( x ) x i x j i b i ( x ) f ( x ) x i .
Then, the transition diffusion probability density satisfies the Fokker–Planck equation
p ( t , x , y ) t = L ^ y * p = 1 2 i j 2 a i j ( y ) p ( t , x , y ) y i y j i b i ( y ) p ( t , x , y ) y i .
A stochastic differential equation
d ξ i ( t ) = j σ j i ( ξ ( t ) ) d w j ( t ) + b i ( ξ ( t ) ) d t ,
where d w j ( t ) is the stochastic differential of the Wiener process, defines diffusion with a generator of the form (A1), where
a i j ( x ) = k σ k i ( x ) σ k j ( x ) ,
and b i in ((A1) and (A2)) coincide with b i in (A3).

References

  1. Turing, A. Computing machinery and intelligence. Mind 1950, 59, 433–460. [Google Scholar] [CrossRef]
  2. von Neumann, J. The Computer and the Brain, 1st ed.; Yale University Press: New Haven, CT, USA, 1958. [Google Scholar]
  3. Manin, Y.I. Complexity vs. energy: Theory of computation and theoretical physics. J. Phys. Conf. Ser. 2014, 532, 012018. [Google Scholar] [CrossRef]
  4. Goldberg, D.E. Genetic Algorithms in Search, Optimization and Machine Learning, 1st ed.; Addison-Wesley: Boston, MA, USA, 1989. [Google Scholar]
  5. Vikhar, P.A. Evolutionary algorithms: A critical review and its future prospects. In Proceedings of the International Conference on Global Trends in Signal Processing, Information Computing and Communication, Jalgaon, India, 22–24 December 2016. [Google Scholar] [CrossRef]
  6. Hopfield, J.J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 1982, 79, 2554–2558. [Google Scholar] [CrossRef] [PubMed]
  7. Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985, 9, 147–169. [Google Scholar]
  8. Kozyrev, S.V. Lotka–Volterra Model with Mutations and Generative Adversarial Networks. Theor. Math. Phys. 2024, 218, 276–284. [Google Scholar] [CrossRef]
  9. Kozyrev, S.V. Transformers as a Physical Model in AI. Lobachevskii J. Math. 2024, 45, 710–717. [Google Scholar] [CrossRef]
  10. LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.A.; Huang, F.J. A Tutorial on Energy-Based Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  11. Chaudhari, P.; Choromanska, A.; Soatto, S.; LeCun, Y.; Baldassi, C.; Borgs, C.; Chayes, J.; Sagun, L.; Zecchina, R. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
  12. Katsnelson, M.I.; Wolf, Y.I.; Koonin, E.V. Towards physical principles of biological evolution. Phys. Scr. 2018, 93, 043001. [Google Scholar] [CrossRef]
  13. Vanchurin, V.; Wolf, Y.I.; Katsnelson, M.I.; Koonin, E.V. Towards a theory of evolution as multilevel learning. Proc. Natl. Acad. Sci. USA 2022, 119, e2120037119. [Google Scholar] [CrossRef]
  14. Zhu, L.; Zhang, C.; Dong, L.; Rong, M.; Gong, J.; Meng, F. Inverse design of electromagnetically induced transparency (EIT) metasurface based on deep convolutional Generative Adversarial Network. Phys. Scr. 2023, 98, 105501. [Google Scholar] [CrossRef]
  15. Norambuena, A.; Mattheakis, M.; González, F.J.; Coto, R. Physics-informed neural networks for quantum control. Phys. Rev. Lett. 2024, 132, 010801. [Google Scholar] [CrossRef]
  16. Dong, D.; Chen, C.; Tarn, T.-J.; Pechen, A.; Rabitz, H. Incoherent Control of Quantum Systems With Wavefunction-Controllable Subspaces via Quantum Reinforcement Learning. IEEE Trans. Syst. Man. Cybern. B Cybern. 2008, 38, 957–962. [Google Scholar] [CrossRef] [PubMed]
  17. Dong, D.; Chen, C.; Li, H.; Tarn, T.-J. Quantum Reinforcement Learning. IEEE Trans. Syst. Man. Cybern. B Cybern. 2008, 38, 1207–1220. [Google Scholar] [CrossRef] [PubMed]
  18. Pechen, A.; Rabitz, H. Teaching the environment to control quantum systems. Phys. Rev. A 2006, 73, 062102. [Google Scholar] [CrossRef]
  19. Biamonte, J.; Wittek, P.; Pancotti, N.; Rebentrost, P.; Wiebe, N.; Lloyd, S. Quantum Machine Learning. Nature 2017, 549, 195–202. [Google Scholar] [CrossRef]
  20. Sieberer, L.M.; Buchhold, M.; Diehl, S. Keldysh field theory for driven open quantum systems. Rep. Prog. Phys. 2016, 79, 096001. [Google Scholar] [CrossRef]
  21. Nokkala, J.; Piilo, J.; Bianconi, G. Complex quantum networks: A topical review. J. Phys. A Math. Theor. 2024, 57, 233001. [Google Scholar] [CrossRef]
  22. Meng, X.; Zhang, J.-W.; Guo, H. Quantum Brownian motion model for the stock market. Physica A 2016, 452, 281–288. [Google Scholar] [CrossRef]
  23. Sun, W.; Chang, Y.; Wang, D.; Zhang, S.; Yan, L. Delegated quantum neural networks for encrypted data. Phys. Scr. 2024, 99, 05510. [Google Scholar] [CrossRef]
  24. Eyring, H. The Activated Complex in Chemical Reactions. J. Chem. Phys. 1935, 3, 107–115. [Google Scholar] [CrossRef]
  25. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the NIPS, Montréal, QC, Canada, 8–13 December 2014. [Google Scholar]
  26. Morzhin, O.V.; Pechen, A.N. Control of the von Neumann Entropy for an Open Two-Qubit System Using Coherent and Incoherent Drives. Entropy 2023, 26, 36. [Google Scholar] [CrossRef]
  27. Qin, C.; Wu, Y.; Springenberg, J.T.; Brock, A.; Donahue, J.; Lillicrap, T.P.; Kohli, P. Training Generative Adversarial Networks by Solving Ordinary Differential Equations. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
  28. Khrulkov, V.; Babenko, A.; Oseledets, I. Functional Space Analysis of Local GAN Convergence. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
  29. Nagarajan, V.; Kolter, J.Z. Gradient descent GAN optimization is locally stable. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  30. Parisi, G. Correlation functions and computer simulations. Nucl. Phys. B 1981, 180, 378–384. [Google Scholar] [CrossRef]
  31. Parisi, G. Correlation functions and computer simulations II. Nucl. Phys. B 1982, 205, 337–344. [Google Scholar] [CrossRef]
  32. Geman, S.; Hwang, C.-R. Diffusions for Global Optimization. SIAM J. Control Optim. 1986, 24, 1031–1043. [Google Scholar] [CrossRef]
  33. Welling, M.; Teh, Y.W. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, DC, USA, 28 June–2 July 2011. [Google Scholar]
  34. Kozyrev, S.V.; Volovich, I.V. The Arrhenius formula in kinetic theory and Witten’s spectral asymptotics. J. Phys. A 2011, 44, 215202. [Google Scholar] [CrossRef]
  35. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
  36. Chang, Z.; Koulieris, G.A.; Shum, H.P.H. On the Design Fundamentals of Diffusion Models: A Survey. arXiv 2023, arXiv:2306.04542. [Google Scholar]
  37. Bousquet, O.; Elisseeff, A. Stability and Generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
  38. Kutin, S.; Niyogi, P. Almost-everywhere algorithmic stability and generalization error. In Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Edmonton, AB, Canada, 1–4 August 2002. [Google Scholar]
  39. Poggio, T.; Rifkin, R.; Mukherjee, S.; Niyogi, P. General conditions for predictivity in learning theory. Nature 2004, 428, 419–422. [Google Scholar] [CrossRef]
  40. Hochreiter, S.; Schmidhuber, J. Flat Minima. Neural Comput. 1997, 9, 1–42. [Google Scholar] [CrossRef]
  41. Avelin, B.; Julin, V.; Viitasaari, L. Geometric Characterization of the Eyring–Kramers Formula. Commun. Math. Phys. 2023, 404, 401–437. [Google Scholar] [CrossRef]
  42. Sevastyanov, B.A. Branching Processes, 1st ed.; Nauka: Moscow, Russia, 1971. [Google Scholar]
  43. Haccou, P.; Jagers, P.; Vatutin, V.A. Branching Processes: Variation, Growth, and Extinction of Populations; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  44. Wentzell, A.D. A Course in the Theory of Stochastic Processes; McGraw-Hill International: New York, NY, USA, 1981. [Google Scholar]
Figure 1. Thermal plot of the function L (left) and its gradient field (right). Red dot in the center of the gradient field plot shows the starting point ( 0 , 0 ) .
Figure 1. Thermal plot of the function L (left) and its gradient field (right). Red dot in the center of the gradient field plot shows the starting point ( 0 , 0 ) .
Entropy 26 01090 g001
Figure 2. Fraction of the runs of the SGLD starting at the point x 0 , which converge to the well c 1 with greater width.
Figure 2. Fraction of the runs of the SGLD starting at the point x 0 , which converge to the well c 1 with greater width.
Entropy 26 01090 g002
Figure 3. Fraction of the points which converge to extrema vs. iteration number plotted for several inverse temperatures β = 0 , 0.75 , 1.5 , 2.25 , 3.0 .
Figure 3. Fraction of the points which converge to extrema vs. iteration number plotted for several inverse temperatures β = 0 , 0.75 , 1.5 , 2.25 , 3.0 .
Entropy 26 01090 g003
Figure 4. Absolute value of the norm of the vector function V ( d ) and its characteristic points: l and A.
Figure 4. Absolute value of the norm of the vector function V ( d ) and its characteristic points: l and A.
Entropy 26 01090 g004
Figure 5. The limiting unstable oscillations around the extremum point for the predator–prey model.
Figure 5. The limiting unstable oscillations around the extremum point for the predator–prey model.
Entropy 26 01090 g005
Figure 6. Simulation of the GAN process for the potentials V and W with two extrema of L ( x ) having different widths (Video S1 available in Supplementary Material). The red dot shows the prey and the blue dot the predator. We observe two regimes: first, the prey escapes of the narrow well and moves to the wider well (first regime), where it starts to oscillate (second regime).
Figure 6. Simulation of the GAN process for the potentials V and W with two extrema of L ( x ) having different widths (Video S1 available in Supplementary Material). The red dot shows the prey and the blue dot the predator. We observe two regimes: first, the prey escapes of the narrow well and moves to the wider well (first regime), where it starts to oscillate (second regime).
Entropy 26 01090 g006
Table 1. Results of the work of linear and quadratic regression with gradient descent, as well as the predator–prey model on the Wine recognition dataset. MSE loss is mean squared error; accuracy is ratio of the corrected predicted class in percent.
Table 1. Results of the work of linear and quadratic regression with gradient descent, as well as the predator–prey model on the Wine recognition dataset. MSE loss is mean squared error; accuracy is ratio of the corrected predicted class in percent.
MSE LossAccuracy
linear GD0.1281%
quadratic GD0.1289%
quadratic PP-model0.0794%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kozyrev, S.V.; Lopatin, I.A.; Pechen, A.N. Control of Overfitting with Physics. Entropy 2024, 26, 1090. https://doi.org/10.3390/e26121090

AMA Style

Kozyrev SV, Lopatin IA, Pechen AN. Control of Overfitting with Physics. Entropy. 2024; 26(12):1090. https://doi.org/10.3390/e26121090

Chicago/Turabian Style

Kozyrev, Sergei V., Ilya A. Lopatin, and Alexander N. Pechen. 2024. "Control of Overfitting with Physics" Entropy 26, no. 12: 1090. https://doi.org/10.3390/e26121090

APA Style

Kozyrev, S. V., Lopatin, I. A., & Pechen, A. N. (2024). Control of Overfitting with Physics. Entropy, 26(12), 1090. https://doi.org/10.3390/e26121090

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop