1. Introduction
Analogies from physics and other fields, particularly population genetics, are of interest when studying problems in machine learning theory. Analogies between machine learning theory and Darwinian evolution theory were discussed already by Alan Turing [
1]. Biological analogies in computing were discussed by John von Neumann [
2]. Physical models in relation to computing were discussed by Yuri Manin [
3]. Such analogies allow physical intuition to be used in learning theory. Among the well-known examples are genetic [
4] and evolutionary algorithms [
5], models of neural networks and physical systems with emergent collective computational abilities and content-addressable memory [
6], a parallel search learning method based on statistical mechanics and Boltzmann machines that mimic Ising spin chains [
7]. A phenomenological model of population genetics, the Lotka–Volterra model with mutations, related to generative adversarial network (GAN) was introduced in [
8]. Analogies between evolution operator in physics and transformers (an artificial intelligence model) were discussed in [
9]. Ideas of thermodynamics in application to learning were considered in [
10,
11] and in relation to the evolution theory in [
12,
13]. GANs have found many applications in physics, e.g., for designing the process of electromagnetically induced transparency metasurfaces [
14], etc. Physics-informed neural networks were developed for solving the optima quantum control of open quantum systems [
15]. Beyond that, the theory of open systems inspired the development of quantum reinforcement learning, where states of the agent are quantum states of some quantum system and the dynamic environment is a quantum environment surrounding the manipulated system to optimize the reward [
16,
17], which is connected to the general incoherent quantum control paradigm [
18]. Various approaches to quantum machine learning were proposed, e.g., [
19]. Open quantum systems were suggested to be applied to spin glasses [
20], network science [
21], and finance [
22]. Quantum neural networks were proposed to be applied for some problems, e.g., to work with encrypted data to protect the privacy of users’ data and models [
23].
Within the framework of such analogies, it is natural to discuss biological analogs for various models of the machine learning theory. In this work, two analogies are considered, between the stochastic gradient Langevin dynamics (SGLD) in machine learning and the Eyring formula in the kinetic theory [
24], and between the GAN model [
25] and the mathematical predator–prey model in biology, where we suggest to consider the discriminator and generator in GAN playing the role of prey and predator, respectively. The proposed analogies allow us to explain the efficiency of controlling overfitting, which is the lack of generalization abilities for a machine learning approach. It is known that for stochastic gradient descent (SGD), overfitting is reduced; for the GAN, this effect of reducing overfitting is even more significant. We propose to explain overfitting control in these processes within the framework of the algorithmic stability approach by suppressing narrow minima of the empirical risk function.
In
Section 2, we consider the stochastic gradient Langevin dynamics (SGLD) using a stochastic differential equation (SDE). We show that the reduction of overfitting in such a model follows from ideas used in chemical kinetics such as the Eyring formula, which states that the reaction rate (the rate of the transition between two potential wells due to diffusion) is determined by free energy of the transition state (the saddle between two potential wells) and the free energy of the initial state of the reaction (optimization of quantities involving entropy-dependent Helmholtz free energy also appears in quantum optimization, e.g., [
26]).
In
Section 3, we describe the minimax problem for the GAN model by a system of two stochastic differential equations (one is for the discriminator and another is for the generator). In this sense, the GAN model is a two-body generalization of the SGLD model considered in
Section 2. However, this generalization significantly changes the behavior of the learning system, as demonstrated by the simulations below. We show that this model implements a selection of wide maxima of the likelihood function, leading to a reduction of overfitting. A biological interpretation of the GAN model is provided in terms of the interaction between a predator and a prey (the predator is the generator, the prey is the discriminator). Learning for GANs by solving a system of ordinary differential equations was considered in [
27,
28,
29].
In
Section 4, we introduce a generalization of the GAN for a type of population genetics model. We consider a branching random process with diffusion for two types of particles (discriminators and generators), in which discriminators and generators can replicate. In this case, the rates of replication and death of particles depend on the contributions to the functional, minimax, which is defined by the GAN model.
In
Section 5, we provide the numerical simulations illustrating the behavior of the stochastic gradient descent and the predator–prey model of the minimax problem for the GAN for a simple potential with two wells. We observe the following two regimes: pushing out of the narrower well and oscillations in the wider well. One parameter of the interaction allows to control the transition between these two regimes. The third regime, when escape out of the both wells occurs, is possible; however, this regime can be avoided by adjusting the parameters.
Some relevant notions from the theory of random processes are provided in
Appendix A.
Section 6 summarizes the results.
3. The GAN Model and Overfitting
3.1. Stochastic Gradient Langevin Dynamic for GAN
The generative adversarial network (GAN) model is a minimax problem, such that [
25]
is a sample, where , and are parametric families of probability distributions on the space Z, called the discriminator and the generator, with parameters x and y from statistical manifolds X and Y (which we will assume to be real vector spaces).
In [
25], the generator was considered as a parametric family of mappings from some auxiliary space to
Z. These mappings transferred the probability distribution on the auxiliary space to
Z. The discriminator was described by a distribution on the same space as the data, and the interpretation was as follows: the discriminator outputs binary variable “one” or “zero” given the data and given the generator according to this distribution (i.e., outputs “one” if the discriminator considers these inputs as correct).
The first contribution to
in (8) is the log-likelihood function. The second contribution behaves qualitatively as the minus inverse of the Kullback–Leibler distance (Kullback–Leibler divergence, KL distance, or KLD) between the distributions of the discriminator and the generator (that is, this contribution is negative, large in magnitude for small KL–distances, and grows to zero for large KL–distances), as was mentioned in [
25].
Here
where the Kullback–Leibler distance between probability distributions
p and
q is defined as
The minimax for over x, y is obtained from the local maximum of over x. Transitions between the local maxima of generate transitions between local minimaxes of (the generator follows the discriminator) as shown below.
The stochastic gradient Langevin dynamics optimization for the problem (7), (8) can be described by a system of SDEs defining random walks
,
,
,
on statistical manifolds
X and
Y of the discriminator and generator,
Here, and are derivatives of V with respect to the first and second arguments, respectively; and are Wiener processes on the parameter spaces of the discriminator and generator. This system of stochastic equations follows directly from the minimax condition and is independent from the KL distance formulation. However, the KLD formulation is used later for analyzing the behavior of the solution. In this system of SDEs, the discriminator seeks to maximize the function (8) with respect to x, and the generator seeks to minimize this function with respect to y.
Example 1. Let us consider one-dimensional parameters x and y for the discriminator and for the generator, respectively, and functional with minimax located at the origin. The noiseless GAN equation system is Its solution iswith oscillations around the minimax. In [
27,
28,
29], the convergence of the optimization of the GAN model by the gradient descent method with respect to the parameters of the discriminator and generator in the neighborhood of the functional’s local minimax was studied, and oscillations of the parameters were discussed.
3.2. Overfitting Control for GAN
If we ignore the presence of the generator, then the dynamics of the discriminator (
9) for optimization with noise will correspond to the diffusion in the potential generated by the data. Thus, the arguments of
Section 2 will be applicable. Therefore, overfitting can be reduced according to the Eyring formula.
The presence of the generator will further suppress overfitting. The minimax problem for the GAN (
7) can be described as follows. The discriminator (
9) tries to reach regions of the parameter
x with high values of
. The generator (10) tries to reach regions of the parameter
y with low values of
. In this case, the contribution to
(8) from the likelihood function depends only on the parameters of the discriminator, i.e., the discriminator tries to increase both contributions to (8), and the generator tries to decrease only the second contribution. The second contribution to (8) decreases at small Kullback–Leibler distances between the discriminator and the generator.
Therefore, the compromise between the optimization problems for the discriminator and the generator will be achieved when they are located at maxima of the contribution from the likelihood function to (8) which are sufficiently wide in the space of the parameters (where the average KL–distance between the discriminator and the generator is not too small). Selecting wide maxima, in accordance with the algorithmic stability approach, will reduce the effect of overfitting.
Here, we propose a biological prey–predator interpretation for the GAN model, which is completely different from the interpretation used in [
25]. In our interpretation, the discriminator is herbivore (prey), the generator is predator, and the data are grass. Then, the minimax problem (
7) describes the situation when the discriminator searches for grass (the maximum of the likelihood function; this corresponds to an increase in the first contribution in (8)) and also runs away from the predator (this corresponds to an increase in the second contribution to (8)), while the predator chases the prey (and hence decreases this contribution). In our interpretation, the discriminator (herbivore)—as a distribution—tries to get closer to the data distribution (grass) and farther from the generator (predator) as a distribution (in the KL distance sense), and the generator tries to be closer to the discriminator. As a result of this interaction, the generator also moves toward the data (grass) because herbivores (discriminator) are likely to be found there; however, this does not mean that the predator tries to imitate grass (this is a mixture of the two interpretations of the GAN). Minimization in (
7) for the generator forces the predator to move to fields (or meadows, likelihood maxima) where the discriminator is present. The interaction of the two contributions to (8) forces the discriminator to search for sufficiently wide meadows (likelihood maxima) where the average KL–distance from the predator is not too small. In general, the predator pushes out the prey from narrow fields of grass, and both the prey and predator move to wide grass fields. Thus, the GAN model implements the selection of wide likelihood maxima, which reduces overfitting.
Simulations illustrating the discussed above behavior are considered in
Section 5 below.
4. Branching Random Process for GAN
In this section, a branching random process with diffusion and particle interactions describing the populations of discriminators and generators in a generalization of the GAN model is introduced.
The theory of branching random processes and its connection with population genetics have been actively discussed in the literature, for example, in [
42,
43]. Previously, in [
8], a generalization of the GAN model related to population genetics (a Lotka–Volterra-type model with mutations), was discussed. In this model, discriminators and generators could reproduce and form populations. The phenomenological equations of population dynamics were considered, and the suppression of overfitting was discussed.
Consider a generalization of the GAN to the case of several discriminators (particles in the hypothesis space of the discriminator with parameter
x, particles are indexed by
a) and generators (particles in the hypothesis space of the generator with parameter
y, particles are indexed by
b). The analog of the SDE system (9) and (10) will take the form
where each particle is associated with its own independent Wiener process
,
on the right-hand side of the equation in the discriminator and generator spaces, respectively, and the terms with interaction on the right-hand sides of the equations have the form
Here, is the likelihood function for the discriminator x.
This corresponds to a GAN-type model with functional for discriminator x and functional for generator y. Equations (11) and (12) describe optimization by the stochastic gradient Langevin dynamics. Each discriminator interacts with a set of generators and similarly, each generator interacts with a set of discriminators. Here, and are sets of discriminators and generators, respectively. The second contribution and the function contain sums from contributions that behave qualitatively as , where is the Kullback–Leibler distance between discriminators and generators with parameters x and y.
Let us define a model which mimics the population genetics, defined by a branching random process with diffusion and interaction with particles of two types , (discriminators and generators), which can perform random walks in accordance with Equations (11) and (12), and have the ability to replicate and die, with the probabilities of such processes depending on the functionals (13) and (14). The replication of a particle consists of replacing it with two particles of the same type with the same coordinates (which can then perform random walks in accordance with (11) and (12)).
We propose to use the following branching rates (as related to Lotka–Volterra-type model discussed in [
8]): the death rate of generators is considered as fixed, while the replication rates of the generators
are proportional to (recall that both
W and
are negative)
the replication rates of discriminators
are proportional to
the rate of death of the discriminator
is proportional to
Thus, discriminators replicate depending on the data and die depending on the generators. Generators replicate depending on the discriminators and die at a constant rate.
The biological interpretation of the proposed model is the following. The data are the distribution of grass, the discriminators are herbivores (prey), the generators are predators; herbivores reproduce on grass, and predators hunt herbivores. The effect of suppressing overfitting on narrow likelihood maxima looks as follows: if the discriminator has replicated on the likelihood maximum (on its statistical manifold X), the generator will tend to go there and replicate there (the generator will tend to the corresponding regions of its statistical manifold Y, such that the KL–distance between and is small). In this case, for a narrow likelihood maximum, the average KL–distance will be small, i.e., the predator will eat the prey more effectively (and then suffer from hunger) than for a wide maximum. This is how the effect of selective suppression of narrow population maxima in X and Y corresponding to narrow likelihood maxima is realized. For the case of the population genetics model for the GAN (where discriminators and generators can replicate), the effect of overfitting control is more pronounced than for the standard GAN model (without replication).
5. Simulations
In this section, the results of the numerical simulation of the SGLD procedure and the simulation of the predator–prey model for the GAN are provided.
5.1. Objective Function
Let
be an objective function for optimization,
. We consider
as a sum of non-normalized Gaussians of the following form:
where
and
are some real-value positive constants,
. For visualization, we use
. Constant
can be interpreted as a characteristic width of the extremum. We consider the case when
for all
, so that
has
n separated extrema.
This objective function has two hills around the two extrema (or wells for the minimization of ; for definiteness, we call them below as wells).
5.2. Stochastic Gradient Langevin Dynamics
For objective function (
15), we consider
,
,
,
for
(here and in similar places below,
T means transpose of the vector, not temperature). Thermal plot of the function
and its gradient field are shown on
Figure 1. As the starting point, set
. Note that
has positive coordinates. Therefore, standard gradient descent procedure starting from the point
will converge to
, which is the well with a smaller width. Consider standard stochastic gradient procedure
where
are two-dimensional independent random variables,
where
is the normal distribution with center at the origin and with variance
,
T is temperature parameter, and
is a fixed learning rate. The scaling
is introduced to ensure convergence of the SGLD process.
We consider 15 values of temperature
T uniformly spanned on
. For each value of
T, we run the SGLD starting from the same initial point
with a maximal number of iterations
and select those runs which in no more than
K iterations converge to one or another well (in their
-vicinities). Then, we compute the fraction of the runs which converge to the wider well. For
, all runs converge to the well with a smaller width. According to the Eyring formula (
6), we expect an increase in the fraction of the SGLD’s runs from the starting point
with increasing temperature
T, which converge to the well
with a larger width.
Figure 2 confirms this behavior.
The efficiency of SGLD depends on such hyperparameters as temperature and learning rate, which influence both the convergence behavior and the stability of the solution SGLD. Since our analysis is limited to the use of a stochastic differential equation, we just use a sufficiently low learning rate value without a detailed analysis and tuning of its value. The conditions on the learning rate are that it should be sufficiently small to guarantee not escaping the target well and also it should be not very small, since if it is very small, it will require a large number of iterations. Temperature is relevant for our consideration. An increase in the temperature affects the convergence and stability of the SGLD because if the temperature is too high, fluctuations can throw the trajectory beyond the minimum. To demonstrate this, we plot the dependence of the trajectories’ fractions on
Figure 3 that converge to one of the two extrema on the number of iterations for different temperatures and for a single learning rate.
5.3. Predator–Prey Model
For the predator–prey model, we consider a more general dynamical system than the system defined by Equations (9) and (10). Let
be position of the prey and
be position of the predator at time
t. In simulations for visualization, we consider
. We consider their joint evolution as governed by the system of equations
where
V and
W are some vector functions (forces) describing the interaction between the prey and the predator (in this subsection, they are not the same as functions
V and
W considered in the previous sections). Informally speaking,
tries to evolve in a way to simultaneously maximize the objective function
and the distance to the predator, while
tries to evolve in a way to minimize the distance to the prey. For
W, we choose
which defines the motion of y with a constant speed towards x.
A key point in our analysis is to find suitable conditions on the vector function , which depends on the difference . Let be the distance between and . We suggest the potentials to have the following general behavior at various distances:
At short distances, , where is an estimate of the minimum acceptable width of the well (i.e., some width from which we do not consider the well to be narrow). For this distance, we assume to allow a predator to push the prey out of the well.
At intermediate distances, , where is the estimate from above of the width of the well of . For these distances, we assume . This condition is introduced to have oscilations in sufficiently wide wells.
At long distances, , we assume to guarantee convergence to some well of .
As an explicit potential which satisfies these conditions, we take
with some parameters
. It contains the Yukawa potential
as a summand. The parameter
l can be interpreted as a characteristic intermediate distance and
A as a mid-range predator, as shown in
Figure 4. The norm of the interaction vector function
V along with its characteristic points for the parameter values described below is plotted in
Figure 4.
We assume that by tuning the parameters of the interaction vector functions
V, we can push the system out of narrow wells of
. To estimate these parameters, we consider the case of limiting oscillations around a radially symmetric well. Schematically, such oscillations can be represented in
Figure 5. Note that this behavior is not stable for small values of
due to the discreteness of the step also known as learning rate. Let
and
be distances between the extremum and
x or
y, respectively, and let
be the angle between the predator–prey line and the radial line. In our simulations, the smallest, critical angle value is about
. For this behavior with limiting oscillations and radially symmetric structure, we have the following relation:
To enable pushing out of narrow wells and oscillations inside sufficiently wide wells, the constants should satisfy the following heuristic conditions:
The constant A should be small enough not to generate a too strong pushing-out potential; otherwise, the prey would escape all the wells.
The constant A should satisfy , so that x can keep at some distance from y.
The constant l should correspond to a sufficient width of the well. If l is too large, the non-convergence to any well can occur because x will enter the regime of running away from y. If l is too small, the dynamics will not have oscillations in the wider well.
The constant
in Equation (
18) determines the minimal width of a well. The width of the well from which there will be pushout due to the short-range Yukawa potential is defined as ∼
(up to some constant).
The constant C must be chosen to be large enough so that the Yukawa potential creates a repulsion stronger than the attraction of the gradient near the narrow well.
The parameter c characterizes the rate of the transition from intermediate to long distances; its value is taken to be large enough. With increasing c, the norm of the vector function tends to be more step-like with a gap at .
A typical evolution is shown in
Figure 6 (
Video S1 available in Supplementary Material). The model parameters for this simulation are the following. Centers of the two Gaussian extrema in (
15) (
) are
, widths
, and amplitudes
. The parameters of the function
V defined by Equation (
18) are the following:
,
,
,
,
. Starting points for the prey and for the predator are
,
, respectively.
We observe the possibility of having the following two regimes: pushing out of the narrower well and oscillations in a wider well. One of the parameters of the interaction vector functions controls the transition between these two regimes. It has the meaning of maximal width of the well from which pushout is expected. The third regime, when escape out of the both wells occurs, is possible, but we were able to overcome this regime by adjusting the parameters.
5.4. Application to Wine Recognition Dataset
In
Section 5, we applied the predator–prey model to some synthetic conditions to reveal the desired behavior. While the application of the method to real big datasets is a separate complex task, here we investigate the improvement achieved by the suggested method on an educational small dataset. As an example, we consider the Wine recognition dataset from scikit-learn library (
https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-recognition-dataset, accessed on 10 October 2024). This dataset consists on 178 instances with 13 numeric attributes. Target is class attribute which is encoded by numbers
. To make the overfitting process more visible and for simplicity of having overfitting, we consider this dataset as a regression task. Linear regression is a good solution for this dataset; however, to create an overfitting situation, we consider quadratic regression with ordinary gradient descent and compare it with the predator–prey model. We randomly split the dataset on train and test sets (train is
) and run gradient descent for linear and quadratic regression and predator–prey model for quadratic regression. The obtained results, summarized in
Table 1, show that the predator–prey model allows to improve the learning results. Running with other random splittings of the dataset into train and test subsets, which show similar results in general. However, several differences may occur. First, linear regression sometimes shows better results than quadratic regression. Second, at first, iteration gradient descent sometimes occurs faster that the predator–prey model convergence to some minimum, but then GD goes up and starts overfitting, while the predator–prey model oscillates around the minimum that avoids overfitting. Thus, in general, our finding on this example is that the predator–prey model works better and sometimes much better to reduce the effect of overfitting.