1. Introduction
Dualities provide a highly useful technique for solving complex problems that have found applications in many branches of science, most notably in physics. For example, well-known dualities include electric–magnetic duality [
1], wave–particle duality [
2], target–space dualities [
3], Kramers–Wannier duality [
4,
5], etc. More recent, but less well-known examples include a quantum–classical duality [
6], dual path integral [
7], and pseudo-forest duality [
8], to name a few. The key idea of all physical dualities is to establish a mapping (i.e., a duality mapping) between two systems (e.g., physical theories), which can then be used to study properties (e.g., obtaining solutions) of one system by analyzing the other system, or vice versa.
Perhaps the most well-studied example of a physical duality is the so-called bulk-boundary or holographic duality [
9,
10], such as AdS–CFT [
11,
12,
13,
14]. In AdS–CFT, the mapping is established between the bulk, representing a gravitational theory on the anti-de Sitter (AdS) background, and the boundary, representing a conformal field theory (CFT) without gravity.
Mathematical dualities focus on more formal, abstract transformations preserving algebraic or geometric structures, but are also very useful in physics. For instance, dual vector space duality [
15], duality of points and lines in projective geometry [
16], Hom and Tensor Duality [
17], etc.
In this article, we shall consider a learning duality, which we shall refer to as the dataset-learning duality. It is often convenient to view the dataset as representing the states of non-trainable variables on the ‘boundary’, and learning as representing the dynamics (or changes in states) of trainable variables in the ‘bulk’. Thus, the dataset-learning duality may be considered to be an example of a bulk-boundary duality in the context of learning theory. As we shall argue, the duality is a complex nonlinear mapping between very high-dimensional spaces, but near the equilibrium state, the analysis is greatly simplified. This simplification allows us to apply the dataset-learning duality to study the emergence of criticality in learning systems.
Criticality in physical systems refers to the behavior of these systems at or near critical points, where they undergo phase transitions. These critical points are characterized by dramatic changes in physical properties due to the collective behavior of the system’s components. Understanding phase transitions and criticality is essential for explaining many physical phenomena [
18] as well as complex biological phenomena such as biological phase transitions [
19]. Recent studies have shown that many physical [
20,
21,
22] and biological [
23,
24] systems can be modeled as learning systems, such as artificial neural networks. Therefore, understanding the criticality and phase transitions in artificial neural networks may also shed light on the emergence of criticality in physical and biological systems.
The phenomenon of self-organized criticality was investigated using biologically inspired discrete network models, which adapt their topology based on local rules without any central control to achieve a balance between stability and flexibility. In papers [
25,
26,
27,
28,
29,
30], networks reach a critical state, where they exhibit power-law distributions of avalanche sizes and durations, indicative of self-organized criticality (SOC), where the criticality is described by a scale-invariant and power-law distribution of fluctuations.
As for the study of artificial neural networks that can be used to solve real applied problems, significant investigation has been carried out in the works [
31,
32]. There, the eigenvalue spectra of the weight matrices of a trained neural network were empirically examined, and it was found that well-generalizing models exhibit heavy-tailed (power-law) spectral distributions, suggesting that learning guides networks toward a form of self-organized criticality that acts as an implicit regularization. In another paper [
33], the researchers carefully analyzed the learning dynamics of two-layer neural networks in the mean-field limit, showing that as the network width increases, the loss landscape becomes increasingly convex and the Hessian spectrum becomes concentrated, implying that wide networks self-organize into a near-critical regime characterized by flat minima and increased sensitivity to disturbances. At the same time, the authors of another paper [
34] analytically demonstrated that even in randomly initialized neural networks, nonlinear activation functions can cause significant deviations from the classical spectra of random matrices, suggesting that architectural nonlinearity alone can initiate the emergence of structured, critical behavior in the neural network.
Our own study of criticality concerns the study of fluctuations in trainable variables and the possibility of their adoption by a power-law distribution in learning equilibrium. The first step in this direction was made in Ref. [
35], where the criticality, or a power-law distribution of fluctuations, was derived analytically using a macroscopic, or thermodynamic, description developed in [
36]. These results were then confirmed numerically [
35] for a classification learning task involving handwritten digits [
37]. In this paper, we take another step towards developing a theory of emergent criticality by providing a more microscopic description of the phenomena using the dataset-learning duality. We then test our predictions numerically for a classification task.
The paper is organized as follows. In
Section 2, a theoretical model of neural networks and basic notations are introduced.
Section 3 is devoted to developing a statistical description of a local dataset-learning duality. In
Section 4, the distribution of the variables of the tangent space dual to the boundary (dataset) space is analyzed in the multidimensional case. In
Section 5, a toy model with two trainable and two non-trainable variables is introduced, and in
Section 6, it is solved for power-law fluctuations of the trainable variables. In
Section 7, we present numerical results for some specific power-law distributions obtained for specific compositions of activation and loss functions for a toy and a realistic large model. In
Section 8, we summarize and discuss the main results of the paper.
2. Neural Networks
Consider an artificial neural network defined as a neural septuple
[
36], where:
, is a vector state of N neurons,
, is a projection to subspace of boundary neurons,
, is a probability distribution which describes the training dataset,
, is a weight matrix which describes connections between neurons,
, is a bias vector which describes bias in inputs of individual neurons,
, is an activation map which describes a nonlinear part of the dynamics where ,
, is a loss function of both trainable and non-trainable variables.
It is assumed that the bias vector
and weight matrix
are the only trainable parameters which can be combined into a single trainable vector
via transformation tensors
, and
, (Einstein summation convention over repeated indices is implied here and throughout the manuscript unless stated otherwise.)
In standard neural networks, including feedforward [
38], convolutional [
39], auto-encoder [
40], transformers [
41], etc., certain trainable variables can be shared, fixed, or set to zero, but all such architectures can be described using appropriate choices of constant tensors
and
. Then Equation (
1) can be viewed as a linear map from a
K-dimensional space of trainable variables to an
-dimensional space of weights and biases, where
K can be much smaller than
.
The neural septuple defines the three relevant types of dynamics and their relevant time-scales:
Activation dynamics describes the updating of neurons on the smallest time-scale, which can be set to one, due to their connection with each other and the forward propagation of a signal from the boundary. Bulk neurons are usually subject to such dynamics,
where
is the identity matrix. However, in the general case of a fully unconstrained neural network, all neurons can undergo activation dynamics.
Boundary dynamics describes updates of boundary neurons on the intermediate time-scales, e.g., once per L unit time steps, e.g., drawn from probability distribution which describes the dataset.
Learning dynamics describes changes in trainable variables
on the largest time-scale
where
M is the so-called mini-batch size. For example, for the stochastic gradient descent method,
where
is the averaging over mini-batch and
is the learning rate. Please note that
is a unit on the scale of
changes; in further analytical reasoning, for simplicity, we set
. We emphasize that the method (
3) implies that the space of trainable parameters is flat globally and coordinates
are orthonormal, i.e., in the case under consideration
, where
is the Kronecker delta.
For example, in the case of the feedforward neural network the weight matrix must be nilpotent and if its degree is L, i.e., , then the deep neural network has L layers and the state of boundary neurons can be updated once every L time steps.
In what follows, we will be interested in the duality mapping from the boundary space to the tangent space of trainable variables. As we shall see, the map is a composition of the activation and learning passes.
3. Dataset-Learning Duality
The main objective of this section is to establish a duality mapping between the tangent space of trainable variables and the boundary subspace of non-trainable variables. For starters, we consider a large neural network, defined in
Section 2, in a local learning equilibrium, i.e., when the mean value of the trainable parameters remains nearly constant and the evolution is dominated by stochastic fluctuations [
36]. In other words, the subject of the study will be a high-dimensional, yet local, problem that allows for significant simplifications. In particular, this allows us to reduce the effective dimensionality
K of the space of trainable variables to the dimensionality of the space of non-trainable variables
N, or even to the dimensionality of the boundary subspace of non-trainable variables
. However, we will not consider possible symmetries in the dataset that could potentially reduce the dimensionality further.
The learning dynamics, described by Equation (
3), can be viewed as a map from non-trainable degrees of freedom
to changes in trainable degrees of freedom
, i.e.,
where, after the vertical bar, there is a set of parameters of the mapping, which are fixed. Therefore, in Equation (
4), and further vector
is the mean equilibrium value of the trainable variables vector. Similar notations containing a vertical bar cutting off fixed parameters of mapping are used further. Please note that vector
makes sense of bulk neuron values after the final L-th activation step, which will be used to form the prediction of the neural network. The loss function will explicitly depend only on these values of bulk neurons. In Equation (
4) and further notation
denotes the vector of bulk neurons after
i steps of activation dynamics.
For example, if the loss function
H of trainable and non-trainable variables is separable, i.e.,
then, the evolution of weights and biases is given by
where there is no summation over
j,
is the total argument of the activation function
. Please note that for simplicity of notation,
denotes both the argument of the loss function, formed by bulk neurons after L activation steps, and the function itself if the argument is specified in parentheses after it. Therefore, expressions (
6) describe the first step in back-propagation algorithms, ellipses denote all other back-propagation steps.
The activation dynamics, in turn, makes it possible to establish a connection between the initial state of the vector of non-trainable variables
and its state at later time
, i.e.,
Please note that during the activation pass, the input neurons
remain fixed and act as a source for the bulk neurons
, which change with time
. As a result, the vector of bulk neurons in any activation step
can be expressed through the vector
before the activation starts by recursively applying the activation function. The arguments in parentheses after
specify the functional form of the mapping; its specific form depends on the variables on which it depends explicitly, and they are specified as arguments. This notation is valid in Equation (
7) and in similar cases throughout the paper.
Composition of the activation (
7) and learning (
4) maps is a map from non-trainable degrees of freedom
at time
to changes in trainable degrees of freedom
at time
, i.e.,
For example, if the learning dynamics is described by stochastic gradient descent, then the map is given by
which is a map from
N-dimensional space of non-trainable variables to
K-dimensional space of fluctuations of trainable variables. Therefore, the probability distribution
can be expressed as
where the different subscripts are used to emphasize that these are different probability distribution functions, e.g.,
and
(which is also apparent from the arguments of these functions). The vertical bar is used for conditional distribution, e.g.,
and for emphasizing the fixed parameterization of functions, e.g.,
. Also,
denotes the K-dimensional Dirac delta function. Please note that we also abuse notation and denote variables,
, and function,
, using the same symbols.
If the bulk neurons are initialized to zeros (or other constant values) at
time moment before starting of activation for every dataset element, then
, and by integrating (
10) over
we obtain
If
, then we should be able to perform a local coordinate transformation
so that
direction become constraints, i.e.,
for
. Please note that there is more than one way to do it.
The tangent vector
can be projected onto
-dimensional dynamical subspace with a projection matrix
of size
,
where all components are dynamical. We believe that the region, where this linear transformation of the trainable variables provides fluctuations along new
coordinate axes in Euclidean space, is quite wide. Let us emphasize that although we are considering a problem local to
, non-locality in the boundary space of neurons
in the general case leads us to the need to introduce curvilinear coordinate system
instead of (
12) to satisfy the requirement of fluctuations only along the
coordinate axes according to considering duality.
Writing down expression similar to Equation (
11) in transformed variables and integrating it over
constrained directions we obtain
If the map
is invertible, then it can be considered to be a true duality, and then the probability distributions are related through Jacobian matrix
We shall refer to this map as the dataset-learning duality. For non-invertible maps
we can write
where there is summation over different
that are mapped to the same
. However, even in this more general case, the contribution from a single term in the summation might dominate (e.g., if
dominates for some
), and then Equation (
15) would still be approximately satisfied.
In summary, to achieve true local dataset-learning duality via the linear transformation (Equation (
12)) of the trainable variables, the range over which this transformation can accurately identify trainable directions in the entire space
must be sufficiently large, and the Jacobian of the transition in Equation (
15) must be invertible.
4. Distribution of Fluctuations
To obtain an expression for the Jacobian in Equation (
15), the gradient descent Equation (
9) must be rewritten in the transformed variables (
12) and (
13):
where
. (Please note that if
is an orthogonal matrix, i.e., the transition to the new trainable variables is carried out through a rotation transformation, then the metric in the new variable space would remain unchanged, i.e.,
.) The Jacobian matrix can be expressed as
By substituting it back to (
15), we obtain an expression for the probability distribution of fluctuations of trainable variables
with three factors:
—distribution of non-trainable input neurons
; —Jacobian and Hessian of the loss function
; —dependence of the neural network (result or prediction) on the dataset/boundary variables and trainable variables .
The first factor depends directly on the boundary dynamics, i.e., training dataset, the second factor depends on the learning dynamics, i.e., the loss function, and the third factor depends on the activation dynamics, i.e., activation function.
We recall that the transformation to primed variables (
12), only
of which fluctuate, can be carried out in different ways, i.e., the transformation matrix
is not unique. In fact, we have at our disposal the entire subspace in which the nonzero vector
lies, to introduce an arbitrary affine coordinate system. We can carry out linear transformations in this subspace, and they will not change the form of Equations (
17)–(
19). At the same time, we have no reason to prefer one coordinate system to another in this subspace for the requirement of power-law distributions along its axes. This freedom can be used to choose the transformed (or primed) trainable variables in which the probability distribution function approximately factorizes, i.e.,
Then the
m-th statistical moment for some component of the original (or unprimed) variables over some range of scales is given by
where the relationship between primed and unprimed variables (
12) was used.
We shall also assume that each component of fluctuation in the unprimed variables can be determined by only one single (possibly not the same for all) component in the primed variables, i.e.,
where summation over
j is absent. Then continuing to calculate
m-th statistical moment for
we obtain
where the probability distribution of the corresponding primed variables is given by
Thus, a power-law distribution of fluctuations for the original trainable variable
has the same power-law,
where the change in normalization is associated with a change in the range over which the statistical moment in unprimed variables is calculated. Thus, assuming scale-invariance of the distribution in the transformed variables
, we concluded that, under the condition of distribution
factorization, its power-law form remains unchanged even for the directly trainable weights and biases, as was observed in the previous research [
35].
To summarize this section, we have directly demonstrated that the distributions of fluctuations of the trainable variables depend on all three types of dynamics: boundary, activation, and learning. In the following sections, we will analyze the contribution to the distribution
that comes directly from the Jacobian in Equation (
19), based on the toy model. We will also discuss how the power-law distribution
can be achieved by choosing certain compositions of the activation and loss functions. Furthermore, we will experimentally demonstrate that it is precisely the contribution from the Jacobian that is responsible for the emergence of criticality in fluctuations, both in the toy model and in large models, even when the input data distributions are Gaussian.
5. Toy Model
In the previous section, we considered a general multidimensional problem and applied the so-called dataset-learning duality to identify trainable variables in which the scale-invariance is expected. In this section, we will consider a simple example of a two-dimensional problem with both continuous and discrete degrees of freedom.
Consider a neural network consisting of only two neurons: input
and output
connected with a single trainable weight
and a bias
. In addition, we assume that the output neuron can take only two possible values, i.e., its marginal distribution is a sum of two delta functions,
In this case, we can reduce the two-dimensional problem (i.e., two trainable variables
and
and two non-trainable variables
and
) to two one-dimensional ones, corresponding to two different values,
and
. Then, for each one-dimensional problem, we can define a single trainable variable
that is a linear function of
and
, and along which fluctuations will occur. At the same time, there will be no fluctuations along the orthogonal direction
according to dataset-learning duality. Then the transformation matrices in Equations (
12) and (
13) are given by
where the rotation angle
corresponds to the state of the output neuron
.
Let us assume that the loss function
H depends on the output of the neural network
f after only a single step of the activation dynamics, but does not depend explicitly on
, i.e.,
and then Equation (
17) for the one-dimensional case can be written as
If we rewrite the loss function as a function of the argument
y of the activation function
, i.e.,
then
and the expression (
18) is given by
Finally, we arrive at the expression for the probability distribution of fluctuations of the trainable variable in our toy model,
which is similar to the general expression (
19).
6. Emergent Criticality
In this section, we shall utilize the dataset-learning duality (see
Section 3) to investigate the potential emergence of criticality arising mainly from the Jacobian matrix (see
Section 4) within the context of the toy model (see
Section 5). The idea is to determine conditions under which the criticality might emerge in the toy model and then verify the results numerically for a toy-model classification problem. Specifically, our aim is to identify compositions of activation and loss functions that give us a power-law dependence of the Jacobian leading to a power-law distribution of fluctuations in the trainable variables.
From conservation of probability in terms of
(here and below we omit the subscript, so there is only one direction of fluctuations) and
y, we obtain
where the function
is assumed to be invertible. On one hand, the power-law dependence of the Jacobian, i.e.,
implies two possible differential equations
for some new
,
, and where the form of expressions for fluctuations depends on their sign. On the other hand, the gradient descent Equation (
32) expressed through
y implies another differential equation
where
and for the gradient descent
In this section, we shall study different compositions of loss and activation functions, i.e.,
, for which Equations (
37) and (
38) are satisfied and thus the emergence of criticality is expected.
For
Equations (
37) and (
38) can be combined together as
By changing variable
in (
41) and integrating in some region with respect to
z we obtain
or
One can show that for
the integral
can be approximated by the integrand, i.e.,
To obtain this result, let us consider the following integral:
In the limit of
, we can integrate by parts (iteratively) to obtain
where we took into account that
.
In the opposite limit
, we obtain
where we took into account that
.
By combining (
46) and (
47) we obtain
for
.
An arbitrary constant coming from the lower limit of the integral must also be added to the expression. We omit it here and in the following cases for simplicity.
Please note that in the limit
,
and
we obtain an exponential function
For
Equations (
37) and (
38) can be combined together as
where
. If
, then
whose solution is given by
Please note that for
,
we obtain the following power-law dependence
In the opposite limit, i.e.,
, we obtain
or
For
and
the desired composition function takes the form
For
or
Equations (
37) and (
38) can be combined together as
If the quadratic term in the denominator is small, i.e.,
, then upon integration we obtain
For
,
we obtain the following form of composition function
As can be seen from the formulas given in the boxes, exponential, power-law and logarithmic composition of the activation and loss functions are possible, and all of them, when certain above-mentioned conditions are met, can lead to a power-law contribution from the Jacobian (see Equation (
35)) to the distribution of fluctuations of the trainable variable
under consideration.
7. Numerical Results
7.1. Toy Model
In this section, we shall present numerical results for the toy model described in
Section 5, and compare them with analytical results obtained in
Section 6. In particular, we are interested in studying the emergence of critical behavior, or when the distribution of changes in trainable rotated variable
is described by a power-law, i.e.,
The goal is to provide specific numerical examples of the critical behavior (
35) on some ranges of scales for exponential (
49), power-law (
56), and logarithmic (
59) compositions of activation and loss functions, i.e., for
.
For the numerical experiments, we consider a simple dataset with only two classes (called ‘−’ and ‘+’) with a single output neuron which takes discrete values (
or
). The input subspace also consists of only a single neuron that takes continuous values drawn from two Gaussian probability distributions
(for
) and
(for
) with mean
where notations
and
mean averaging over the distributions of the values of the input neuron
with conditions
and
, respectively. The standard deviations are chosen as follows
This simple dataset allows us to use the architecture of the toy model of
Section 5 with one input neuron
, one output neuron
, one trainable bias
b, and one trainable weight
w (from input neuron to output neuron).
Please note that the main goal of this analysis is not to obtain high prediction accuracy (although it is above
in all experiments), or to develop an architecture for a realistic classification problem (although a similar dataset would have been obtained if we considered a truncated MNIST dataset [
37] with only images of ‘zeros’ and ‘ones’ and input states projected down to a single dimension). Rather, our aim is to demonstrate the emergence of criticality within a simple low-dimensional problem, but we expect the same mechanisms to be responsible for the emergence of criticality in higher-dimensional problems [
35].
Also, note that in the learning equilibrium, the average value of the trainable parameters does not change significantly. Therefore, the distribution of the actual changes in trainable variables, i.e., , and the distribution of potential jumps when the trainable parameters remain fixed, i.e., , are essentially the same. However, the latter is more suitable, i.e., we will consider potential jumps caused separately by ‘−’ and ‘+’ classes, which, in the learning equilibrium, can be described as a two one-dimensional problem.
In the remainder of the section, we shall consider five experiments, two for exponential (
49), two for power-law (
56), and one for logarithmic (
59) composition of activation and loss functions.
7.1.1. Exponential Composition,
Consider the sigmoid activation function
with mean-squared loss
or cross-entropy loss
For composition of sigmoid activation (
63) and mean-squared loss (
64) functions and ‘−’ class, in the limit
Equation (
63) reduces to
and then the composition function is
Likewise, for the same composition and ‘+’ class, in the limit
Equation (
63) reduces to
and then the composition function is
Below are the graphs showing the experimental dependencies used to compute the coefficients
for both classes in this case. Knowing these coefficients, one can be confident that the conditions required to obtain the exponential composition (
49) are satisfied. Moreover, it is clear from the graphs in
Figure 1 that the mapping between
y and
is invertible for chosen points, i.e., duality between these variables is observed. At the same time, from graphs in
Figure 2 it is clear that the linear transformation of variables
provides a good approximation of the curve along which fluctuations actually occur. We also carried out a similar procedure to verify the relationships between the coefficients
in subsequent experiments, selecting for analysis only those regions where true dataset-learning duality is realized.
In
Figure 3, we plot in logarithmic axes the distributions of fluctuations of
(blue dots) and the contribution from the Jacobian (red dots) to these distributions corresponding to Equation (
35). The linear behavior with slope
(or
) is in good agreement with Equation (
49).
For composition of sigmoid activation (
63) and cross-entropy loss (
65) function we obtain, for ‘−’ class and in the limit
,
and for ‘+’ class and in the limit
,
In
Figure 4, we plot the distribution of fluctuations of
, which is once again in agreement with Equation (
49). Please note that in this case, despite a nearly perfect power-law contribution from the Jacobian (red dots), the distribution of fluctuations as a whole (blue dots) is significantly distorted by the contribution from
in Equation (
35).
We conclude that for the toy-model classification problem with a sigmoid activation function and either mean-squared error or cross-entropy loss functions, the fluctuations of trainable variables can follow a power law with , as confirmed both numerically and analytically.
7.1.2. Power-Law Composition,
Consider a composition of a ReLU activation function
and the power-law loss function
In this case, non-vanishing fluctuations can take place only when the argument of ReLU is positive, i.e.,
then, the composition function is also a power law
For
, i.e., the standard mean-squared loss, Equation (
56) implies that
which is in agreement with the numerical results plotted in
Figure 5.
For
, Equation (
56) implies that
in agreement with the numerical results plotted in
Figure 6.
Thus, we have demonstrated that the power-law composition case (
56) is achieved by combining power-law loss and ReLU activation functions. More generally, we can obtain any
, where
n is the power that appears in (
73).
7.1.3. Logarithmic Composition,
Consider the cross-entropy loss function (
65) and a piece-wise linear activation function
Within the range
, the composition of activation and loss function is
which is the logarithm composition of Equation (
59) for ‘−’ class (i.e.,
) with
, and for ‘+’ class (i.e.,
) with
.
In
Figure 7, we plot the distribution of fluctuations of
, for which
, in good agreement with the analytical results in (
59).
7.2. Large Model
Next, we describe a procedure that reveals the power-law distribution of fluctuations in the trainable variables of a large neural network, even in the absence of a power-law distribution in the input data.
First, let us address the input data by excluding directions in the data space along which the values of non-trainable variables vary weakly. To achieve this, we compute the covariance matrix
, perform its spectral decomposition
, and switch to the basis of its eigenvectors, where the covariance matrix becomes diagonal:
.
Now, we define the transformed vector as , where the matrix consists of the first n rows of . In other words, the components of are the projections of the original vector onto the eigenvectors of the covariance matrix corresponding to its largest n eigenvalues. Thus, we work in the subspace of the input data most relevant for learning. In this way, one can effectively estimate the dimensionality of the input data space, which in practical problems is often much smaller than the total number of input neurons.
Next, we calculate the cross-covariance matrix
of the input data in the transformed variables
with fluctuations of the trainable variables
. After that, we perform SVD decomposition
and move to the bases of the left and right singular vectors in the input space and space of trainable variable fluctuations, respectively. In these new variables cross-covariance matrix
has nonzero elements only on the main diagonal.
In such variables, the dataset-learning duality becomes explicit due to the absence of correlations between the input data and the fluctuations of the trainable parameters for , i.e., now making the transformation , where consists of the first n rows of matrix , we conclude that the vector of size n and will consider fluctuations along its components.
Note also that the input space is actually mapped to a manifold that is not flat. However, using the cross-covariance matrix, we can obtain a linear approximation of this generally curved subspace in and gain an understanding of where the input subspace is locally mapped.
Next, we will consider what distribution we have along the direction in input space and what distribution is obtained along the direction in the trainable variables fluctuations space correlated with the chosen direction in . For this, we will select the components and along the directions of left and right singular vectors of the cross-covariance matrix in respective spaces, which correspond to the largest singular value of .
7.2.1. UNN Model
Now let us consider a large fully unconstrained neural network (See
Section 2), consisting of 100 input neurons
(hereafter, we omit the subscript and denote them simply as
) and 10 output neurons
. The network is fully connected, meaning that every neuron is connected to every other neuron. Please note that there are no bulk (hidden) neurons; instead, both the input and output neurons are updated during the activation dynamics, as they are mutually connected and do not remain fixed. After all steps of activation dynamics (in our example, we used five steps), the values of the output neurons
are compared with the correct target values for training. The input data
consists of the 100 pixels with the greatest variance across the entire dataset. Thus, the neural network with this architecture is trained to classify images of handwritten digits. Next, the distributions of fluctuations in the local learning equilibrium are analyzed according to the procedure described above.
It can be seen in
Figure 8 and
Figure 9 that the distribution along
has a form close to a Gaussian distribution, while the distribution of fluctuations along
has an explicit power form in a wide range of scales. In
Figure 8, the hyperbolic tangent was used as the activation function for all neurons, and the cross-entropy form of the loss function was chosen, while in
Figure 9, the ReLU activation function was used everywhere, and the mean-squared loss function was applied.
7.2.2. CNN Model
Now, let us consider another example of a convolutional neural network, which consists of 2 convolutional layers and two fully connected layers. The total number of trainable variables is 10,000. This neural network, as in other examples, is used to classify the MNIST dataset. The values of all available pixels for each image of a digit, , are fed to the boundary neurons as input data.
Distributions of input data and fluctuations of the trainable variables in the local learning equilibrium for the CNN are shown in
Figure 10 and
Figure 11. The results are similar to those obtained for the UNN.
It is worth noting that we have already obtained such distribution powers
and
analytically and experimentally for the toy model. Interestingly, the experimental data suggest that the presence of criticality and the value of the distribution power do not depend on the size of the neural network. When using the same loss and activation functions in both small and large neural networks, we observed similar powers for the fluctuation distributions at learning equilibrium, namely
(
Figure 4,
Figure 8 and
Figure 10) and
(
Figure 5,
Figure 9 and
Figure 11). In all considered models, the power-law distributions of trainable variable jumps do not originate from the input data distributions; rather, their emergence is driven by specific forms of the activation and loss functions.
8. Discussion
In this paper, we accomplished two main tasks.
Firstly, we established a duality mapping between the space of boundary neurons (i.e., the dataset) and the tangent space of trainable variables (i.e., the learning), the so-called dataset-learning duality. Both spaces (the boundary and the tangent) are generally very high-dimensional, making the analysis of the duality very non-trivial. However, by considering the problem in a local learning equilibrium and under the assumption that the probability distribution function of the tangent space variables is factorizable, the multidimensional problem can be greatly simplified.
Secondly, we applied the dataset-learning duality to study the emergence of criticality in the learning systems. We show that the observed scale-invariance of fluctuations of the trainable variables (e.g., weights and biases) is caused by the emergence of criticality in the dual tangent space of trainable variables. In particular, we analyzed different compositions of activation and loss function, which can give rise to a power-law distribution of fluctuations of trainable variables on a wide range of scales for a toy model. We showed that the power-law, exponential, and logarithmic compositions of activation and loss functions can all give rise to criticality in the learning dynamics, even if the dataset is in a non-critical state. Main results of the study of criticality are summarized in the following table.
| | Examples |
1 | | Cross-entropy or mean-squared loss functions and sigmoid activation function |
| | Power-law loss and ReLU activation functions |
2 | | Cross-entropy loss and piece-wise linear activation functions |
In addition, we conducted an experiment with a large model classifying digit images from the MNIST dataset. We found that by using the same activation and loss functions as in the toy model, the large neural network exhibits the same power-law behaviors. Thus, we have empirically demonstrated that the results obtained from the toy model generalize to real neural networks. The emergence of criticality and the specific exponents of the trainable variables’ fluctuation distributions depend not on the network size, but on the choice of certain loss and activation functions.
Besides its theoretical significance and potential relevance in modeling critical phenomena in physical and biological systems [
18,
19], the emergence of criticality is expected to play a central role in machine learning applications. Power-law distributions, in particular, enable trainable variables to explore a broader range of scales without facing exponential suppression. Consequently, criticality is presumed to prevent neural networks from becoming trapped in local minima, a highly desirable property for any learning system. However, the analysis of the learning efficiency and its relation to criticality must involve considerations of non-equilibrium systems, which are beyond the scope of the current paper. Nevertheless, we expect that the established dataset-learning duality can be developed further to shed light on non-equilibrium problems. We leave these and other related questions for future research.