Next Article in Journal
Continuous-Variable Quantum Key Distribution Based on N-APSK Modulation over Seawater Channel
Previous Article in Journal
On the Application of a Hybrid Incomplete Exponential Sum to Aperiodic Hamming Correlation of Some Frequency-Hopping Sequences
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dataset-Learning Duality and Emergent Criticality

by
Ekaterina Kukleva
1,* and
Vitaly Vanchurin
1,2
1
Artificial Neural Computing, Weston, FL 33332, USA
2
Duluth Institute for Advanced Study, Duluth, MN 55804, USA
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(9), 989; https://doi.org/10.3390/e27090989
Submission received: 23 July 2025 / Revised: 12 September 2025 / Accepted: 17 September 2025 / Published: 22 September 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

In artificial neural networks, the activation dynamics of non-trainable variables are strongly coupled to the learning dynamics of trainable variables. During the activation pass, the boundary neurons (e.g., input neurons) are mapped to the bulk neurons (e.g., hidden neurons), and during the learning pass, both bulk and boundary neurons are mapped to changes in trainable variables (e.g., weights and biases). For example, in feedforward neural networks, forward propagation is the activation pass and backward propagation is the learning pass. We show that a composition of the two maps establishes a duality map between a subspace of non-trainable boundary variables (e.g., dataset) and a tangent subspace of trainable variables (i.e., learning). In general, the dataset-learning duality is a complex nonlinear map between high-dimensional spaces. We use duality to study the emergence of criticality, or the power-law distribution of fluctuations of the trainable variables, using a toy and large models at learning equilibrium. In particular, we show that criticality can emerge in the learning system even from the dataset in a non-critical state, and that the power-law distribution can be modified by changing either the activation function or the loss function.

1. Introduction

Dualities provide a highly useful technique for solving complex problems that have found applications in many branches of science, most notably in physics. For example, well-known dualities include electric–magnetic duality [1], wave–particle duality [2], target–space dualities [3], Kramers–Wannier duality [4,5], etc. More recent, but less well-known examples include a quantum–classical duality [6], dual path integral [7], and pseudo-forest duality [8], to name a few. The key idea of all physical dualities is to establish a mapping (i.e., a duality mapping) between two systems (e.g., physical theories), which can then be used to study properties (e.g., obtaining solutions) of one system by analyzing the other system, or vice versa.
Perhaps the most well-studied example of a physical duality is the so-called bulk-boundary or holographic duality [9,10], such as AdS–CFT [11,12,13,14]. In AdS–CFT, the mapping is established between the bulk, representing a gravitational theory on the anti-de Sitter (AdS) background, and the boundary, representing a conformal field theory (CFT) without gravity.
Mathematical dualities focus on more formal, abstract transformations preserving algebraic or geometric structures, but are also very useful in physics. For instance, dual vector space duality [15], duality of points and lines in projective geometry [16], Hom and Tensor Duality [17], etc.
In this article, we shall consider a learning duality, which we shall refer to as the dataset-learning duality. It is often convenient to view the dataset as representing the states of non-trainable variables on the ‘boundary’, and learning as representing the dynamics (or changes in states) of trainable variables in the ‘bulk’. Thus, the dataset-learning duality may be considered to be an example of a bulk-boundary duality in the context of learning theory. As we shall argue, the duality is a complex nonlinear mapping between very high-dimensional spaces, but near the equilibrium state, the analysis is greatly simplified. This simplification allows us to apply the dataset-learning duality to study the emergence of criticality in learning systems.
Criticality in physical systems refers to the behavior of these systems at or near critical points, where they undergo phase transitions. These critical points are characterized by dramatic changes in physical properties due to the collective behavior of the system’s components. Understanding phase transitions and criticality is essential for explaining many physical phenomena [18] as well as complex biological phenomena such as biological phase transitions [19]. Recent studies have shown that many physical [20,21,22] and biological [23,24] systems can be modeled as learning systems, such as artificial neural networks. Therefore, understanding the criticality and phase transitions in artificial neural networks may also shed light on the emergence of criticality in physical and biological systems.
The phenomenon of self-organized criticality was investigated using biologically inspired discrete network models, which adapt their topology based on local rules without any central control to achieve a balance between stability and flexibility. In papers [25,26,27,28,29,30], networks reach a critical state, where they exhibit power-law distributions of avalanche sizes and durations, indicative of self-organized criticality (SOC), where the criticality is described by a scale-invariant and power-law distribution of fluctuations.
As for the study of artificial neural networks that can be used to solve real applied problems, significant investigation has been carried out in the works [31,32]. There, the eigenvalue spectra of the weight matrices of a trained neural network were empirically examined, and it was found that well-generalizing models exhibit heavy-tailed (power-law) spectral distributions, suggesting that learning guides networks toward a form of self-organized criticality that acts as an implicit regularization. In another paper [33], the researchers carefully analyzed the learning dynamics of two-layer neural networks in the mean-field limit, showing that as the network width increases, the loss landscape becomes increasingly convex and the Hessian spectrum becomes concentrated, implying that wide networks self-organize into a near-critical regime characterized by flat minima and increased sensitivity to disturbances. At the same time, the authors of another paper [34] analytically demonstrated that even in randomly initialized neural networks, nonlinear activation functions can cause significant deviations from the classical spectra of random matrices, suggesting that architectural nonlinearity alone can initiate the emergence of structured, critical behavior in the neural network.
Our own study of criticality concerns the study of fluctuations in trainable variables and the possibility of their adoption by a power-law distribution in learning equilibrium. The first step in this direction was made in Ref. [35], where the criticality, or a power-law distribution of fluctuations, was derived analytically using a macroscopic, or thermodynamic, description developed in [36]. These results were then confirmed numerically [35] for a classification learning task involving handwritten digits [37]. In this paper, we take another step towards developing a theory of emergent criticality by providing a more microscopic description of the phenomena using the dataset-learning duality. We then test our predictions numerically for a classification task.
The paper is organized as follows. In Section 2, a theoretical model of neural networks and basic notations are introduced. Section 3 is devoted to developing a statistical description of a local dataset-learning duality. In Section 4, the distribution of the variables of the tangent space dual to the boundary (dataset) space is analyzed in the multidimensional case. In Section 5, a toy model with two trainable and two non-trainable variables is introduced, and in Section 6, it is solved for power-law fluctuations of the trainable variables. In Section 7, we present numerical results for some specific power-law distributions obtained for specific compositions of activation and loss functions for a toy and a realistic large model. In Section 8, we summarize and discuss the main results of the paper.

2. Neural Networks

Consider an artificial neural network defined as a neural septuple ( x , P ^ , p , w ^ , b , f , H ) [36], where:
  • x R N , is a vector state of N neurons,
  • P ^ , is a projection to subspace of N = Tr ( P ^ ) boundary neurons,
  • p ( x ) , is a probability distribution which describes the training dataset,
  • w ^ R N 2 , is a weight matrix which describes connections between neurons,
  • b R N , is a bias vector which describes bias in inputs of individual neurons,
  • f ( y ) , is an activation map which describes a nonlinear part of the dynamics where y = w ^ x + b ,
  • H ( x , w ^ , b ) , is a loss function of both trainable and non-trainable variables.
It is assumed that the bias vector b and weight matrix w ^ are the only trainable parameters which can be combined into a single trainable vector q via transformation tensors W i j l , and B i l , (Einstein summation convention over repeated indices is implied here and throughout the manuscript unless stated otherwise.)
(1) w i j = W i j l q l , b i = B i l q l .
In standard neural networks, including feedforward [38], convolutional [39], auto-encoder [40], transformers [41], etc., certain trainable variables can be shared, fixed, or set to zero, but all such architectures can be described using appropriate choices of constant tensors W i j l and B i l . Then Equation (1) can be viewed as a linear map from a K-dimensional space of trainable variables to an N 2 + N -dimensional space of weights and biases, where K can be much smaller than N 2 + N .
The neural septuple ( x , P ^ , p , w ^ , b , f , H ) defines the three relevant types of dynamics and their relevant time-scales:
  • Activation dynamics describes the updating of neurons on the smallest time-scale, which can be set to one, due to their connection with each other and the forward propagation of a signal from the boundary. Bulk neurons are usually subject to such dynamics,
    x ( t + 1 ) = ( I ^ P ^ ) x ( t + 1 ) = ( I ^ P ^ ) f ( w ^ x ( t ) + b ) ,
    where I ^ is the identity matrix. However, in the general case of a fully unconstrained neural network, all neurons can undergo activation dynamics.
  • Boundary dynamics describes updates of boundary neurons x = P ^ x on the intermediate time-scales, e.g., once per L unit time steps, e.g., drawn from probability distribution p ( x ) which describes the dataset.
  • Learning dynamics describes changes in trainable variables q on the largest time-scale M L where M is the so-called mini-batch size. For example, for the stochastic gradient descent method,
    q ˙ i = q i ( t + M L ) q i ( t ) = γ δ i j d H M d q j
    where M is the averaging over mini-batch and γ is the learning rate. Please note that M L is a unit on the scale of q changes; in further analytical reasoning, for simplicity, we set M = 1 . We emphasize that the method (3) implies that the space of trainable parameters is flat globally and coordinates q i are orthonormal, i.e., in the case under consideration g i j = δ i j , where δ i j is the Kronecker delta.
For example, in the case of the feedforward neural network the weight matrix w ^ must be nilpotent and if its degree is L, i.e., w ^ L = 0 , then the deep neural network has L layers and the state of boundary neurons can be updated once every L time steps.
In what follows, we will be interested in the duality mapping from the boundary space to the tangent space of trainable variables. As we shall see, the map is a composition of the activation and learning passes.

3. Dataset-Learning Duality

The main objective of this section is to establish a duality mapping between the tangent space of trainable variables and the boundary subspace of non-trainable variables. For starters, we consider a large neural network, defined in Section 2, in a local learning equilibrium, i.e., when the mean value of the trainable parameters remains nearly constant and the evolution is dominated by stochastic fluctuations [36]. In other words, the subject of the study will be a high-dimensional, yet local, problem that allows for significant simplifications. In particular, this allows us to reduce the effective dimensionality K of the space of trainable variables to the dimensionality of the space of non-trainable variables N, or even to the dimensionality of the boundary subspace of non-trainable variables N = Tr ( P ^ ) . However, we will not consider possible symmetries in the dataset that could potentially reduce the dimensionality further.
The learning dynamics, described by Equation (3), can be viewed as a map from non-trainable degrees of freedom x to changes in trainable degrees of freedom q ˙ , i.e.,
( f , x ( t L 1 ) , , x ( t 0 ) , x | q ) q ˙ ,
where, after the vertical bar, there is a set of parameters of the mapping, which are fixed. Therefore, in Equation (4), and further vector q is the mean equilibrium value of the trainable variables vector. Similar notations containing a vertical bar cutting off fixed parameters of mapping are used further. Please note that vector f makes sense of bulk neuron values after the final L-th activation step, which will be used to form the prediction of the neural network. The loss function will explicitly depend only on these values of bulk neurons. In Equation (4) and further notation x ( t i ) denotes the vector of bulk neurons after i steps of activation dynamics.
For example, if the loss function H of trainable and non-trainable variables is separable, i.e.,
H = H x ( x , f ) + H q ( q ) ,
then, the evolution of weights and biases is given by
b ˙ j = γ δ j i H q b i γ f j ( y j ( t L 1 ) ) y j ( t L 1 ) H x f j + , (6) w ˙ j k = γ δ j i H q w i m δ m k γ δ k l x l ( t L 1 ) f j ( y j ( t L 1 ) ) y j ( t L 1 ) H x f j + .
where there is no summation over j, y j ( t L 1 ) = w j i x i ( t L 1 ) + b j is the total argument of the activation function f j . Please note that for simplicity of notation, f denotes both the argument of the loss function, formed by bulk neurons after L activation steps, and the function itself if the argument is specified in parentheses after it. Therefore, expressions (6) describe the first step in back-propagation algorithms, ellipses denote all other back-propagation steps.
The activation dynamics, in turn, makes it possible to establish a connection between the initial state of the vector of non-trainable variables ( x , x ( t 0 ) ) and its state at later time ( x , x ( t i ) ) , i.e.,
x ( t 1 ) = f ( y ( t 0 ) ) , x ( t 2 ) = f ( y ( t 1 ) ) , x ( t L ) = f ( y ( t L 1 ) ) = f ( x ( t 0 ) | q ) .
Please note that during the activation pass, the input neurons x remain fixed and act as a source for the bulk neurons x ( t i ) , which change with time t i . As a result, the vector of bulk neurons in any activation step x ( t i ) can be expressed through the vector x ( t 0 ) before the activation starts by recursively applying the activation function. The arguments in parentheses after f specify the functional form of the mapping; its specific form depends on the variables on which it depends explicitly, and they are specified as arguments. This notation is valid in Equation (7) and in similar cases throughout the paper.
Composition of the activation (7) and learning (4) maps is a map from non-trainable degrees of freedom ( x , x ( t 0 ) ) at time t 0 to changes in trainable degrees of freedom q ˙ at time t L , i.e.,
( x , x ( t 0 ) | q ) q ˙ .
For example, if the learning dynamics is described by stochastic gradient descent, then the map is given by
q ˙ k ( x , x ( t 0 ) | q ) = γ δ k m f j ( x ( t 0 ) , q ) q m f j + q m H ( x , f | q ) .
which is a map from N-dimensional space of non-trainable variables to K-dimensional space of fluctuations of trainable variables. Therefore, the probability distribution p q ˙ ( q ˙ ) can be expressed as
p q ˙ ( q ˙ | q ) = p x x ( x , x ( t 0 ) ) δ ( K ) q ˙ q ˙ ( x , x ( t 0 ) | q ) d N x ( t 0 ) .
where the different subscripts are used to emphasize that these are different probability distribution functions, e.g., p q ˙ ( ) and p x x ( ) (which is also apparent from the arguments of these functions). The vertical bar is used for conditional distribution, e.g., p q ˙ ( q ˙ | q ) and for emphasizing the fixed parameterization of functions, e.g., q ˙ ( x , x ( t 0 ) | q ) . Also, δ K denotes the K-dimensional Dirac delta function. Please note that we also abuse notation and denote variables, q ˙ , and function, q ˙ ( x , x ( t 0 ) | q ) , using the same symbols.
If the bulk neurons are initialized to zeros (or other constant values) at t 0 time moment before starting of activation for every dataset element, then p x x ( x , x ( t 0 ) ) = p x ( x ) δ ( N ) ( x ( t 0 ) ) , and by integrating (10) over x ( t 0 ) we obtain
p q ˙ ( q ˙ | q ) = p x ( x ) δ ( K ) q ˙ q ˙ ( x | x ( t 0 ) , q ) d N x .
If K > N , then we should be able to perform a local coordinate transformation
q i = Λ i j q j .
so that K N direction become constraints, i.e., q ˙ i = 0 for i = N + 1 , , K . Please note that there is more than one way to do it.
The tangent vector q ˙ can be projected onto N -dimensional dynamical subspace with a projection matrix R ^ of size N × K ,
q ˙ i = R i r Λ r j q ˙ j
where all components are dynamical. We believe that the region, where this linear transformation of the trainable variables provides fluctuations along new N coordinate axes in Euclidean space, is quite wide. Let us emphasize that although we are considering a problem local to q , non-locality in the boundary space of neurons x in the general case leads us to the need to introduce curvilinear coordinate system q instead of (12) to satisfy the requirement of fluctuations only along the N coordinate axes according to considering duality.
Writing down expression similar to Equation (11) in transformed variables and integrating it over K N constrained directions we obtain
p q ˙ ( q ˙ | q ) = p x ( x ) δ ( N ) q ˙ q ˙ ( x | x ( t 0 ) , q ) d N x .
If the map q ˙ ( x | x ( t 0 ) , q ) is invertible, then it can be considered to be a true duality, and then the probability distributions are related through Jacobian matrix
p q ˙ ( q ˙ ) = p x ( x ( q ˙ ) ) det q ˙ i x j 1 .
We shall refer to this map as the dataset-learning duality. For non-invertible maps q ˙ ( x | x ( t 0 ) , q ) we can write
p q ˙ ( q ˙ ) d q ˙ = p x ( x ) d x ,
where there is summation over different x that are mapped to the same q ˙ . However, even in this more general case, the contribution from a single term in the summation might dominate (e.g., if p x ( x ) dominates for some x ), and then Equation (15) would still be approximately satisfied.
In summary, to achieve true local dataset-learning duality via the linear transformation (Equation (12)) of the trainable variables, the range over which this transformation can accurately identify trainable directions in the entire space q must be sufficiently large, and the Jacobian of the transition in Equation (15) must be invertible.

4. Distribution of Fluctuations

To obtain an expression for the Jacobian in Equation (15), the gradient descent Equation (9) must be rewritten in the transformed variables (12) and (13):
(17) q ˙ i ( x | q ) = γ R i r Λ r k δ k m Λ l m d H ( x , f | q ) d q m = γ R i r g r l f j ( x | q ) q l H x f j + H q q l ,
where g r l = Λ r k Λ l k = Λ Λ T r l . (Please note that if Λ is an orthogonal matrix, i.e., the transition to the new trainable variables is carried out through a rotation transformation, then the metric in the new variable space would remain unchanged, i.e., g r l = δ r l .) The Jacobian matrix can be expressed as
q ˙ i x k = γ R i r g r l 2 f j ( x | q ) x k q l f j + f j ( x | q ) q l 2 x k f j H x ( x , f ) .
By substituting it back to (15), we obtain an expression for the probability distribution of fluctuations of trainable variables
p q ˙ ( q ˙ ) = γ N p x ( x ( q ˙ ) ) det R i r g r l 2 f j ( x | q ) x k q l f j + f j ( x | q ) q l 2 x k f j H x ( x , f ) 1
with three factors:
  • p x ( x ) —distribution of non-trainable input neurons
  • H x f j ; 2 H x x k f j —Jacobian and Hessian of the loss function
  • 2 f j ( x | q ) x k q l ; f j ( x | q ) q l —dependence of the neural network (result or prediction) f on the dataset/boundary variables x and trainable variables q .
The first factor depends directly on the boundary dynamics, i.e., training dataset, the second factor depends on the learning dynamics, i.e., the loss function, and the third factor depends on the activation dynamics, i.e., activation function.
We recall that the transformation to primed variables (12), only N of which fluctuate, can be carried out in different ways, i.e., the transformation matrix Λ is not unique. In fact, we have at our disposal the entire subspace in which the nonzero vector q ˙ lies, to introduce an arbitrary affine coordinate system. We can carry out linear transformations in this subspace, and they will not change the form of Equations (17)–(19). At the same time, we have no reason to prefer one coordinate system to another in this subspace for the requirement of power-law distributions along its axes. This freedom can be used to choose the transformed (or primed) trainable variables in which the probability distribution function approximately factorizes, i.e.,
p q ˙ ( q ˙ ) i = 1 N p q ˙ i ( q ˙ i ) .
Then the m-th statistical moment for some component of the original (or unprimed) variables over some range of scales is given by
(21) q ˙ i m = q ˙ i m ( q ˙ ) p q ˙ ( q ˙ ( q ˙ ) ) q ˙ q ˙ d q ˙ = ( Λ 1 ) i j q ˙ j m p q ˙ ( q ˙ ) d q ˙ ( Λ 1 ) i j q ˙ j m i = 1 N p q ˙ i ( q ˙ i ) d q i .
where the relationship between primed and unprimed variables (12) was used.
We shall also assume that each component of fluctuation in the unprimed variables can be determined by only one single (possibly not the same for all) component in the primed variables, i.e.,
q ˙ i Λ i j 1 q ˙ j ,
where summation over j is absent. Then continuing to calculate m-th statistical moment for q ˙ i we obtain
(23) q ˙ i m = ( Λ 1 ) i j q ˙ j m A j q ˙ j k j d q ˙ j = = ( Λ 1 ) i j q ˙ j m A j ( Λ 1 ) i j q ˙ j k j ( Λ 1 ) i j k j + 1 d ( Λ 1 ) i j q ˙ j = q ˙ i m p q ˙ i ( q ˙ i ) d q ˙ i ,
where the probability distribution of the corresponding primed variables is given by
p q ˙ j ( q ˙ j ) = A j q ˙ j k j , q ˙ j [ a , b ] .
Thus, a power-law distribution of fluctuations for the original trainable variable q i has the same power-law,
p q ˙ i ( q ˙ i ) = A j q ˙ i k j ( Λ 1 ) i j k j + 1 , q ˙ i ( Λ 1 ) i j a , ( Λ 1 ) i j b ,
where the change in normalization is associated with a change in the range over which the statistical moment in unprimed variables is calculated. Thus, assuming scale-invariance of the distribution in the transformed variables q ˙ , we concluded that, under the condition of distribution p q ˙ ( q ˙ ) factorization, its power-law form remains unchanged even for the directly trainable weights and biases, as was observed in the previous research [35].
To summarize this section, we have directly demonstrated that the distributions of fluctuations of the trainable variables depend on all three types of dynamics: boundary, activation, and learning. In the following sections, we will analyze the contribution to the distribution p q ˙ ( q ˙ ) that comes directly from the Jacobian in Equation (19), based on the toy model. We will also discuss how the power-law distribution p q ˙ ( q ˙ ) can be achieved by choosing certain compositions of the activation and loss functions. Furthermore, we will experimentally demonstrate that it is precisely the contribution from the Jacobian that is responsible for the emergence of criticality in fluctuations, both in the toy model and in large models, even when the input data distributions are Gaussian.

5. Toy Model

In the previous section, we considered a general multidimensional problem and applied the so-called dataset-learning duality to identify trainable variables in which the scale-invariance is expected. In this section, we will consider a simple example of a two-dimensional problem with both continuous and discrete degrees of freedom.
Consider a neural network consisting of only two neurons: input x 1 and output x 2 connected with a single trainable weight w = w 21 and a bias b = b 2 . In addition, we assume that the output neuron can take only two possible values, i.e., its marginal distribution is a sum of two delta functions,
p x 2 ( x 2 ) = 1 2 δ ( x 2 X + ) + 1 2 δ ( x 2 X ) .
In this case, we can reduce the two-dimensional problem (i.e., two trainable variables q 1 = w and q 2 = b and two non-trainable variables x 1 and x 2 ) to two one-dimensional ones, corresponding to two different values, x 2 = X + and x 2 = X . Then, for each one-dimensional problem, we can define a single trainable variable q 1 that is a linear function of q 1 and q 2 , and along which fluctuations will occur. At the same time, there will be no fluctuations along the orthogonal direction q 2 according to dataset-learning duality. Then the transformation matrices in Equations (12) and (13) are given by
Λ ± = cos θ ± sin θ ± sin θ ± cos θ ± ,
R = 1 0 ,
where the rotation angle θ ± corresponds to the state of the output neuron x 2 = X ± .
Let us assume that the loss function H depends on the output of the neural network f after only a single step of the activation dynamics, but does not depend explicitly on q , i.e.,
H = H x ( f | x 2 ) ,
and then Equation (17) for the one-dimensional case can be written as
q ˙ 1 = γ f ( x 1 | q ) q 1 H f .
If we rewrite the loss function as a function of the argument y of the activation function f ( y ) , i.e.,
y ( x 1 , q ) = ( x 1 cos θ ± + sin θ ± ) q 1 + ( x 1 sin θ ± + cos θ ± ) q 2
then
q ˙ 1 = γ y q 1 H ( y | x 2 ) y = γ ( x 1 cos θ ± + sin θ ± ) H ( y | x 2 ) y .
and the expression (18) is given by
q ˙ 1 x 1 = γ cos θ ± H ( y | x 2 ) y .
Finally, we arrive at the expression for the probability distribution of fluctuations of the trainable variable in our toy model,
p q ˙ 1 ( q ˙ 1 ) = p x 1 ( x 1 ( q ˙ 1 ) ) γ cos θ ± H ( y | x 2 ) y 1 ,
which is similar to the general expression (19).

6. Emergent Criticality

In this section, we shall utilize the dataset-learning duality (see Section 3) to investigate the potential emergence of criticality arising mainly from the Jacobian matrix (see Section 4) within the context of the toy model (see Section 5). The idea is to determine conditions under which the criticality might emerge in the toy model and then verify the results numerically for a toy-model classification problem. Specifically, our aim is to identify compositions of activation and loss functions that give us a power-law dependence of the Jacobian leading to a power-law distribution of fluctuations in the trainable variables.
From conservation of probability in terms of q ˙ (here and below we omit the subscript, so there is only one direction of fluctuations) and y, we obtain
p q ˙ ( q ˙ ) = p y ( y ( q ˙ ) ) y q ˙ ,
where the function y ( q ˙ ) is assumed to be invertible. On one hand, the power-law dependence of the Jacobian, i.e.,
d y d q ˙ = 1 A | q ˙ | k , A = A ( w , b ) > 0 , k 0 ,
implies two possible differential equations
q ˙ = sign ( q ˙ ) exp ( A y + B ) for k = 1 sign ( q ˙ ) ( A y + B ) 1 1 k for k 1
for some new A = A ( w , b ) , B = B ( w , b ) , and where the form of expressions for fluctuations depends on their sign. On the other hand, the gradient descent Equation (32) expressed through y implies another differential equation
q ˙ = ( D y C ) H q ,
where
C = γ cos ( θ ± ) w ( b w tan ( θ ± ) ) , D = γ cos ( θ ± ) w .
and for the gradient descent
D y C > 0 .
In this section, we shall study different compositions of loss and activation functions, i.e., H ( f ( y ) ) , for which Equations (37) and (38) are satisfied and thus the emergence of criticality is expected.
For k = 1 Equations (37) and (38) can be combined together as
d H ( f ( y ) ) d y = sign ( q ˙ ) exp ( A y + B ) D y C ,
By changing variable z = A y A C / D in (41) and integrating in some region with respect to z we obtain
H ( f ( z ) ) = sign ( q ˙ ) 1 D exp A C D + B z 0 z d z exp ( z ) z ,
or
H ( f ( y ) ) = sign ( q ˙ ) 1 D exp A C D + B z 0 A y A C / D d z exp ( z ) z ,
One can show that for | z | 1 the integral d z exp ( z ) z can be approximated by the integrand, i.e.,
H ( f ( y ) ) sign ( q ˙ ) exp ( A y + B ) A ( D y C ) .
To obtain this result, let us consider the following integral:
I ( x ) = x exp ( z ) z d z .
In the limit of x 1 , we can integrate by parts (iteratively) to obtain
I ( x ) = exp ( x ) x + exp ( x ) x n = 1 n ! x n exp ( x ) x + exp ( x + 1 ) x 2 exp ( x ) x ,
where we took into account that n = 1 n ! x n e x 1 1 x .
In the opposite limit y x 1 , we obtain
I ( y ) = exp ( y ) y + exp ( y ) y n = 1 ( 1 ) n + 1 n ! y n exp ( y ) y + exp ( y 1 ) y 2 exp ( y ) y ,
where we took into account that n = 1 ( 1 ) n + 1 n ! y n e 1 y 1 + 1 y .
By combining (46) and (47) we obtain
I ( x ) exp ( x ) x .
for | x | 1 .
An arbitrary constant coming from the lower limit of the integral must also be added to the expression. We omit it here and in the following cases for simplicity.
Please note that in the limit | D y | | C | , | A | | D / C | and B = log | A C | we obtain an exponential function
H ( f ( y ) ) = exp ( A y ) .
For k [ 0 , 1 ) Equations (37) and (38) can be combined together as
(50) d H ( f ( y ) ) d y = sign ( q ˙ ) ( A y + B ) α D y C = sign ( q ˙ ) A D ( A y + B ) α ( A y + B ) B + A C D ,
where α 1 / ( 1 k ) > 1 . If | A y + B | B + A C D , then
d H ( f ( y ) ) d y = sign ( q ˙ ) A D ( A y + B ) α A y + B = sign ( q ˙ ) A D ( A y + B ) α 1 ,
whose solution is given by
H ( f ( y ) ) = sign ( q ˙ ) A α α D y + B A α .
Please note that for A α α D = 1 , Δ = B A we obtain the following power-law dependence
H ( f ( y ) ) = ( y + Δ ) α .
In the opposite limit, i.e., | A y + B | B + A C D , we obtain
d H ( f ( y ) ) d y = sign ( q ˙ ) A D ( A y + B ) α B + A C D ,
or
H ( f ( y ) ) = sign ( q ˙ ) A α + 1 y + B A α + 1 ( α + 1 ) ( A C + B D ) .
For A α + 1 ( α + 1 ) ( A C + B D ) = 1 and Δ = B A the desired composition function takes the form
H ( f ( y ) ) = ( y + Δ ) α + 1 .
For k = 2 or α = 1 / ( 1 k ) = 1 Equations (37) and (38) can be combined together as
d H ( f ( y ) ) d y = sign ( q ˙ ) ( A y + B ) ( C D y ) = sign ( q ˙ ) A D y 2 + ( A C B D ) y + B C ,
If the quadratic term in the denominator is small, i.e., | A D y | | A C B D | , then upon integration we obtain
H ( f ( y ) ) = sign ( q ˙ ) log | ( A C B D ) y + B C | A C B D .
For | A C B D | = 1 , Δ = B C we obtain the following form of composition function
H ( f ( y ) ) = log | Δ y | .
As can be seen from the formulas given in the boxes, exponential, power-law and logarithmic composition of the activation and loss functions are possible, and all of them, when certain above-mentioned conditions are met, can lead to a power-law contribution from the Jacobian (see Equation (35)) to the distribution of fluctuations of the trainable variable q under consideration.

7. Numerical Results

7.1. Toy Model

In this section, we shall present numerical results for the toy model described in Section 5, and compare them with analytical results obtained in Section 6. In particular, we are interested in studying the emergence of critical behavior, or when the distribution of changes in trainable rotated variable q is described by a power-law, i.e.,
p ( q ˙ ) ( q ˙ ) k .
The goal is to provide specific numerical examples of the critical behavior (35) on some ranges of scales for exponential (49), power-law (56), and logarithmic (59) compositions of activation and loss functions, i.e., for H ( f ( y ) ) .
For the numerical experiments, we consider a simple dataset with only two classes (called ‘−’ and ‘+’) with a single output neuron which takes discrete values ( x 2 = X = 0 or x 2 = X + = 1 ). The input subspace also consists of only a single neuron that takes continuous values drawn from two Gaussian probability distributions p ( x 1 ) (for x 2 = 0 ) and p + ( x 1 ) (for x 2 = 1 ) with mean
(61) x 1 = 0 x 1 + = 1
where notations x 1 and x 1 + mean averaging over the distributions of the values of the input neuron x 1 with conditions x 2 = X and x 2 = X + , respectively. The standard deviations are chosen as follows
(62) ( x 1 ) 2 x 1 2 = ( x 1 ) 2 = 0.25 ( x 1 ) 2 + ( x 1 ) + 2 = ( x 1 ) 2 + 1 = 0.25 .
This simple dataset allows us to use the architecture of the toy model of Section 5 with one input neuron x 1 , one output neuron x 2 , one trainable bias b, and one trainable weight w (from input neuron to output neuron).
Please note that the main goal of this analysis is not to obtain high prediction accuracy (although it is above 97 % in all experiments), or to develop an architecture for a realistic classification problem (although a similar dataset would have been obtained if we considered a truncated MNIST dataset [37] with only images of ‘zeros’ and ‘ones’ and input states projected down to a single dimension). Rather, our aim is to demonstrate the emergence of criticality within a simple low-dimensional problem, but we expect the same mechanisms to be responsible for the emergence of criticality in higher-dimensional problems [35].
Also, note that in the learning equilibrium, the average value of the trainable parameters does not change significantly. Therefore, the distribution of the actual changes in trainable variables, i.e., p ( q ˙ ) , and the distribution of potential jumps when the trainable parameters remain fixed, i.e., p γ H q , are essentially the same. However, the latter is more suitable, i.e., we will consider potential jumps caused separately by ‘−’ and ‘+’ classes, which, in the learning equilibrium, can be described as a two one-dimensional problem.
In the remainder of the section, we shall consider five experiments, two for exponential (49), two for power-law (56), and one for logarithmic (59) composition of activation and loss functions.

7.1.1. Exponential Composition, k = 1

Consider the sigmoid activation function
f = ( 1 + e y ) 1
with mean-squared loss
H ( f , x 2 ) = ( f x 2 ) 2
or cross-entropy loss
H ( f , x 2 ) = ( 1 x 2 ) log ( 1 f ) x 2 log ( f ) .
For composition of sigmoid activation (63) and mean-squared loss (64) functions and ‘−’ class, in the limit y < 0 Equation (63) reduces to f e y and then the composition function is
H ( f ( y ) ) e 2 y .
Likewise, for the same composition and ‘+’ class, in the limit y > 0 Equation (63) reduces to f 1 e y and then the composition function is
H ( f ( y ) ) e 2 y .
Below are the graphs showing the experimental dependencies used to compute the coefficients A , B , C , D for both classes in this case. Knowing these coefficients, one can be confident that the conditions required to obtain the exponential composition (49) are satisfied. Moreover, it is clear from the graphs in Figure 1 that the mapping between y and q ˙ is invertible for chosen points, i.e., duality between these variables is observed. At the same time, from graphs in Figure 2 it is clear that the linear transformation of variables ( q ˙ ) q ˙ provides a good approximation of the curve along which fluctuations actually occur. We also carried out a similar procedure to verify the relationships between the coefficients A , B , C , D in subsequent experiments, selecting for analysis only those regions where true dataset-learning duality is realized.
In Figure 3, we plot in logarithmic axes the distributions of fluctuations of q (blue dots) and the contribution from the Jacobian (red dots) to these distributions corresponding to Equation (35). The linear behavior with slope 1 (or k = 1 ) is in good agreement with Equation (49).
For composition of sigmoid activation (63) and cross-entropy loss (65) function we obtain, for ‘−’ class and in the limit y < 0 ,
H ( f ( y ) ) = log ( 1 f ( y ) ) e y ,
and for ‘+’ class and in the limit y > 0 ,
H ( f ( y ) ) = log ( f ( y ) ) e y .
In Figure 4, we plot the distribution of fluctuations of q , which is once again in agreement with Equation (49). Please note that in this case, despite a nearly perfect power-law contribution from the Jacobian (red dots), the distribution of fluctuations as a whole (blue dots) is significantly distorted by the contribution from p y ( y ( q ˙ ) ) in Equation (35).
We conclude that for the toy-model classification problem with a sigmoid activation function and either mean-squared error or cross-entropy loss functions, the fluctuations of trainable variables can follow a power law with k = 1 , as confirmed both numerically and analytically.

7.1.2. Power-Law Composition, k = 0 ; 2 3

Consider a composition of a ReLU activation function
f ( y ) = max ( 0 , y )
and the power-law loss function
H ( f , x 2 ) = ( f x 2 ) n .
In this case, non-vanishing fluctuations can take place only when the argument of ReLU is positive, i.e.,
y > 0 ,
then, the composition function is also a power law
H ( f ( y ) ) = ( y x 2 ) n .
For n = 2 , i.e., the standard mean-squared loss, Equation (56) implies that
k = α 1 α = n 2 n 1 = 0 ,
which is in agreement with the numerical results plotted in Figure 5.
For n = 4 , Equation (56) implies that k = 2 3 in agreement with the numerical results plotted in Figure 6.
Thus, we have demonstrated that the power-law composition case (56) is achieved by combining power-law loss and ReLU activation functions. More generally, we can obtain any k = n 2 n 1 , where n is the power that appears in (73).

7.1.3. Logarithmic Composition, k = 2

Consider the cross-entropy loss function (65) and a piece-wise linear activation function
f ( y ) = y , 0 < y < 1 , 0 , y < 0 or y > 1 .
Within the range 0 < y < 1 , the composition of activation and loss function is
H ( f ( y ) ) = ( 1 x 2 ) log ( 1 y ) x 2 log ( y ) .
which is the logarithm composition of Equation (59) for ‘−’ class (i.e., x 2 = 0 ) with | Δ | = 1 , and for ‘+’ class (i.e., x 2 = 1 ) with Δ = 0 .
In Figure 7, we plot the distribution of fluctuations of q , for which k = 2 , in good agreement with the analytical results in (59).

7.2. Large Model

Next, we describe a procedure that reveals the power-law distribution of fluctuations in the trainable variables of a large neural network, even in the absence of a power-law distribution in the input data.
First, let us address the input data by excluding directions in the data space along which the values of non-trainable variables vary weakly. To achieve this, we compute the covariance matrix C x x , perform its spectral decomposition C x x = S A S T , and switch to the basis of its eigenvectors, where the covariance matrix becomes diagonal: C x x = A .
C x x = E [ x x T ] = S A S T , x = S T x , C x x = E [ x x T ] = E [ ( S T x ) ( S T x ) T ] = S T C x x S = A .
Now, we define the transformed vector as x = S c u t T x , where the matrix S c u t T consists of the first n rows of S T . In other words, the components of x are the projections of the original vector x onto the eigenvectors of the covariance matrix C x x corresponding to its largest n eigenvalues. Thus, we work in the subspace of the input data most relevant for learning. In this way, one can effectively estimate the dimensionality of the input data space, which in practical problems is often much smaller than the total number of input neurons.
Next, we calculate the cross-covariance matrix C x q ˙ of the input data in the transformed variables x with fluctuations of the trainable variables q ˙ . After that, we perform SVD decomposition C x q ˙ = U Σ V T and move to the bases of the left and right singular vectors in the input space and space of trainable variable fluctuations, respectively. In these new variables cross-covariance matrix C x q ˙ has nonzero elements only on the main diagonal.
C x q ˙ = E [ x q ˙ T ] = U Σ V T , x = U T x , q ˙ = V T q ˙ , C x q ˙ = E [ x q ˙ T ] = E [ ( U T x ) ( V T q ˙ ) T ] = U T C x q ˙ V = Σ .
In such variables, the dataset-learning duality becomes explicit due to the absence of correlations between the input data x and the fluctuations of the trainable parameters q ˙ i for i > dim ( x ) = n , i.e., now making the transformation q = V c u t T q , where V c u t T consists of the first n rows of matrix V T , we conclude that the vector q of size n and will consider fluctuations along its components.
Note also that the input space x is actually mapped to a manifold that is not flat. However, using the cross-covariance matrix, we can obtain a linear approximation of this generally curved subspace in q ˙ and gain an understanding of where the input subspace is locally mapped.
Next, we will consider what distribution we have along the direction in input space and what distribution is obtained along the direction in the trainable variables fluctuations space correlated with the chosen direction in x . For this, we will select the components ( x ) 1 and ( q ˙ ) 1 along the directions of left and right singular vectors of the cross-covariance matrix in respective spaces, which correspond to the largest singular value of C x q ˙ .

7.2.1. UNN Model

Now let us consider a large fully unconstrained neural network (See Section 2), consisting of 100 input neurons x i n p (hereafter, we omit the subscript and denote them simply as x ) and 10 output neurons x o u t . The network is fully connected, meaning that every neuron is connected to every other neuron. Please note that there are no bulk (hidden) neurons; instead, both the input and output neurons are updated during the activation dynamics, as they are mutually connected and do not remain fixed. After all steps of activation dynamics (in our example, we used five steps), the values of the output neurons x o u t are compared with the correct target values for training. The input data x consists of the 100 pixels with the greatest variance across the entire dataset. Thus, the neural network with this architecture is trained to classify images of handwritten digits. Next, the distributions of fluctuations in the local learning equilibrium are analyzed according to the procedure described above.
It can be seen in Figure 8 and Figure 9 that the distribution along ( x ) 1 has a form close to a Gaussian distribution, while the distribution of fluctuations along ( q ˙ ) 1 has an explicit power form in a wide range of scales. In Figure 8, the hyperbolic tangent was used as the activation function for all neurons, and the cross-entropy form of the loss function was chosen, while in Figure 9, the ReLU activation function was used everywhere, and the mean-squared loss function was applied.

7.2.2. CNN Model

Now, let us consider another example of a convolutional neural network, which consists of 2 convolutional layers and two fully connected layers. The total number of trainable variables is 10,000. This neural network, as in other examples, is used to classify the MNIST dataset. The values of all available pixels for each image of a digit, N = 28 2 , are fed to the boundary neurons as input data.
Distributions of input data and fluctuations of the trainable variables in the local learning equilibrium for the CNN are shown in Figure 10 and Figure 11. The results are similar to those obtained for the UNN.
It is worth noting that we have already obtained such distribution powers k 1 and k 0 analytically and experimentally for the toy model. Interestingly, the experimental data suggest that the presence of criticality and the value of the distribution power do not depend on the size of the neural network. When using the same loss and activation functions in both small and large neural networks, we observed similar powers for the fluctuation distributions at learning equilibrium, namely k 1 (Figure 4, Figure 8 and Figure 10) and k 0 (Figure 5, Figure 9 and Figure 11). In all considered models, the power-law distributions of trainable variable jumps do not originate from the input data distributions; rather, their emergence is driven by specific forms of the activation and loss functions.

8. Discussion

In this paper, we accomplished two main tasks.
Firstly, we established a duality mapping between the space of boundary neurons (i.e., the dataset) and the tangent space of trainable variables (i.e., the learning), the so-called dataset-learning duality. Both spaces (the boundary and the tangent) are generally very high-dimensional, making the analysis of the duality very non-trivial. However, by considering the problem in a local learning equilibrium and under the assumption that the probability distribution function of the tangent space variables is factorizable, the multidimensional problem can be greatly simplified.
Secondly, we applied the dataset-learning duality to study the emergence of criticality in the learning systems. We show that the observed scale-invariance of fluctuations of the trainable variables (e.g., weights and biases) is caused by the emergence of criticality in the dual tangent space of trainable variables. In particular, we analyzed different compositions of activation and loss function, which can give rise to a power-law distribution of fluctuations of trainable variables on a wide range of scales for a toy model. We showed that the power-law, exponential, and logarithmic compositions of activation and loss functions can all give rise to criticality in the learning dynamics, even if the dataset is in a non-critical state. Main results of the study of criticality are summarized in the following table.
k H ( f ( y ) ) Examples
1 exp ( A y ) Cross-entropy or mean-squared loss functions and sigmoid activation function
α 1 α ( y + Δ ) α + 1 Power-law loss and ReLU activation functions
2 log | Δ y | Cross-entropy loss and piece-wise linear activation functions
In addition, we conducted an experiment with a large model classifying digit images from the MNIST dataset. We found that by using the same activation and loss functions as in the toy model, the large neural network exhibits the same power-law behaviors. Thus, we have empirically demonstrated that the results obtained from the toy model generalize to real neural networks. The emergence of criticality and the specific exponents of the trainable variables’ fluctuation distributions depend not on the network size, but on the choice of certain loss and activation functions.
Besides its theoretical significance and potential relevance in modeling critical phenomena in physical and biological systems [18,19], the emergence of criticality is expected to play a central role in machine learning applications. Power-law distributions, in particular, enable trainable variables to explore a broader range of scales without facing exponential suppression. Consequently, criticality is presumed to prevent neural networks from becoming trapped in local minima, a highly desirable property for any learning system. However, the analysis of the learning efficiency and its relation to criticality must involve considerations of non-equilibrium systems, which are beyond the scope of the current paper. Nevertheless, we expect that the established dataset-learning duality can be developed further to shed light on non-equilibrium problems. We leave these and other related questions for future research.

Author Contributions

Conceptualization, V.V.; methodology E.K. and V.V.; software, E.K.; validation, E.K. and V.V.; formal analysis E.K. and V.V.; investigation, E.K. and V.V.; resources, E.K. and V.V.; data curation, E.K.; writing—original draft preparation, E.K. and V.V.; writing—review and editing, E.K. and V.V.; visualization, E.K.; supervision, V.V.; project administration, E.K. and V.V.; funding acquisition, V.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author E.K.

Acknowledgments

The authors are grateful to Yaroslav Gusev for his invaluable assistance with both numerical and analytical problems. His expertise was crucial to the success of this project.

Conflicts of Interest

Authors Ekaterina Kukleva and Vitaly Vanchurin were employed by the company Artificial Neural Computing. Author Vitaly Vanchurin was employed by the company Duluth Institute for Advanced Study. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Griffiths, D.J. Introduction to Electrodynamics, 3rd ed.; Pearson: London, UK, 1999. [Google Scholar]
  2. Feynman, R.P.; Leighton, R.B.; Sands, M. The Feynman Lectures on Physics Vol. III: Quantum Mechanics; Basic Books: New York, NY, USA, 2011. [Google Scholar]
  3. Polchinski, J. String Theory: Volume II, Superstring Theory and Beyond; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
  4. Kramers, H.A.; Wannier, G.H. Statistics of the Two-Dimensional Ferromagnet. Part I. Phys. Rev. 1941, 60, 252–262. [Google Scholar] [CrossRef]
  5. Kramers, H.A.; Wannier, G.H. Statistics of the Two-Dimensional Ferromagnet. Part II. Phys. Rev. 1941, 60, 263–276. [Google Scholar] [CrossRef]
  6. Vanchurin, V. A quantum-classical duality and emergent space-time. arXiv 2019, arXiv:1903.06083. [Google Scholar] [CrossRef]
  7. Vanchurin, V. Dual Path Integral: A non-perturbative approach to strong coupling. Eur. Phys. J. C 2021, 81, 235. [Google Scholar] [CrossRef]
  8. Vanchurin, V. Differential equation for partition functions and a duality pseudo-forest. J. Math. Phys. 2022, 63, 073501. [Google Scholar] [CrossRef]
  9. Susskind, L. The world as a hologram. J. Math. Phys. 1995, 36, 6377–6396. [Google Scholar] [CrossRef]
  10. Hooft, G.’t. Dimensional Reduction in Quantum Gravity. arXiv 1993, arXiv:9310026. [Google Scholar]
  11. Maldacena, J. The Large-N Limit of Superconformal Field Theories and Supergravity. Int. J. Theor. Phys. 1999, 38, 1113–1133. [Google Scholar] [CrossRef]
  12. Witten, E. Anti de Sitter Space and Holography. Adv. Theor. Math. Phys. 1998, 2, 253–291. [Google Scholar] [CrossRef]
  13. Witten, E. Anti-de Sitter Space, Thermal Phase Transition, and Confinement in Gauge Theories. Adv. Theor. Math. Phys. 1998, 2, 505–532. [Google Scholar] [CrossRef]
  14. Gubser, S.S.; Klebanov, I.R.; Polyakov, A.M. Gauge Theory Correlators from Non-Critical String Theory. Phys. Lett. B 1998, 428, 105–114. [Google Scholar] [CrossRef]
  15. Axler, S. Linear Algebra Done Right, 2nd ed.; Springer: New York, NY, USA, 1997. [Google Scholar]
  16. Kagan, V.F. (Ed.) Projective Geometry; Dover Publications: New York, NY, USA, 1952. [Google Scholar]
  17. Cartan, H.; Eilenberg, S. Homological Algebra; Princeton University Press: Princeton, NJ, USA, 1956. [Google Scholar]
  18. Landau, L.D.; Lifshitz, E.M. Statistical Physics. In Course of Theoretical Physics, 2nd ed.; Pergamon Press: Oxford, UK, 1969; Volume 5. [Google Scholar]
  19. Romanenko, A.; Vanchurin, V. Quasi-Equilibrium States and Phase Transitions in Biological Evolution. Entropy 2024, 26, 201. [Google Scholar] [CrossRef] [PubMed]
  20. Vanchurin, V. The World as a Neural Network. Entropy 2022, 22, 1210. [Google Scholar] [CrossRef] [PubMed]
  21. Katsnelson, M.; Vanchurin, V. Emergent Quantumness in Neural Networks. Found. Phys. 2021, 51, 94. [Google Scholar] [CrossRef]
  22. Vanchurin, V. Towards a Theory of Quantum Gravity from Neural Networks. Entropy 2022, 24, 7. [Google Scholar] [CrossRef]
  23. Vanchurin, V.; Wolf, Y.I.; Koonin, E.V.; Katsnelson, M.I. Thermodynamics of evolution and the origin of life. Proc. Natl. Acad. Sci. USA 2022, 119, e2120042119. [Google Scholar] [CrossRef]
  24. Vanchurin, V.; Wolf, Y.I.; Katsnelson, M.I.; Koonin, E.V. Toward a theory of evolution as multilevel learning. Proc. Natl. Acad. Sci. USA 2022, 119, e2120037119. [Google Scholar] [CrossRef]
  25. Rohlf, T.; Bornholdt, S. Self-organized criticality and adaptation in discrete dynamical networks. Phys. Rev. Lett. 2002, 88, 228701. [Google Scholar] [CrossRef]
  26. Rybarsch, M.; Bornholdt, S. Self-organized criticality in neural network models. Phys. Rev. E 2012, 86, 026114. [Google Scholar] [CrossRef]
  27. Landmann, S.; Baumgarten, L.; Bornholdt, S. Self-organized criticality in neural networks from activity-based rewiring. Phys. Rev. E 2021, 103, 032304. [Google Scholar] [CrossRef]
  28. Papa, B.D.; Priesemann, V.; Triesch, J. Criticality meets learning: Criticality signatures in a self-organizing recurrent neural network. PLoS ONE 2017, 12, e0178683. [Google Scholar] [CrossRef]
  29. Kossio, F.Y.K.; Goedeke, S.; van den Akker, B.; Ibarz, B.; Memmesheimer, R.M. Growing Critical: Self-Organized Criticality in a Developing Neural System. Phys. Rev. Lett. 2018, 121, 058301. [Google Scholar] [CrossRef] [PubMed]
  30. Cowan, J.D.; Neuman, J.; van Drongelen, W. Self-organized criticality in a network of interacting neurons. J. Stat. Mech. Theory Exp. 2012, 4, P04030. [Google Scholar] [CrossRef]
  31. Mahoney, M.; Martin, C. Traditional and Heavy-Tailed Self Regularization in Neural Network Models. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 4284–4293. [Google Scholar] [CrossRef]
  32. Martin, C.H.; Mahoney, M.W. Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks. In Proceedings of the 2020 SIAM International Conference on Data Mining, Cincinnati, OH, USA, 7–9 May 2020; pp. 505–513. [Google Scholar] [CrossRef]
  33. Song, M.; Montanari, A.; Nguyen, P. A Mean Field View of the Landscape of Two-Layers Neural Networks. Proc. Natl. Acad. Sci. USA 2018, 115, E7665–E7671. [Google Scholar] [CrossRef]
  34. Benigni, L.; Péché, S. Eigenvalue distribution of some nonlinear models of random matrices. Electron. J. Probab. 2021, 26, 1–37. [Google Scholar] [CrossRef]
  35. Katsnelson, M.I.; Vanchurin, V.; Westerhout, T. Emergent scale invariance in neural networks. Phys. A Stat. Mech. Its Appl. 2023, 610, 128401. [Google Scholar] [CrossRef]
  36. Vanchurin, V. Towards a theory of machine learning. Mach. Learn. Sci. Technol. 2021, 2, 035012. [Google Scholar] [CrossRef]
  37. LeCun, Y.; Cortes, C.; Burges, C. MNIST Handwritten Digit Database; AT&T Labs: Online, 2010. Available online: http://yann.lecun.com/exdb/mnist (accessed on 16 September 2025).
  38. Ivakhnenko, A.; Lapa, V. Cybernetics and Forecasting Techniques; Modern Analytic and Computational Methods in Science and Mathematics; American Elsevier Publishing Company: New York, NY, USA, 1967. [Google Scholar]
  39. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  40. Kramer, M.A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233–243. [Google Scholar] [CrossRef]
  41. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Figure 1. Dependencies for determining coefficients A and B. (a) ‘−’ class. (b) ‘+’ class.
Figure 1. Dependencies for determining coefficients A and B. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g001
Figure 2. Dependencies for determining coefficients C and D. (a) ‘−’ class. (b) ‘+’ class.
Figure 2. Dependencies for determining coefficients C and D. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g002
Figure 3. Distribution of fluctuations for composition of sigmoid activation and mean-squared loss functions. (a) ‘−’ class. (b) ‘+’ class.
Figure 3. Distribution of fluctuations for composition of sigmoid activation and mean-squared loss functions. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g003
Figure 4. Distribution of fluctuations for composition of sigmoid activation and cross-entropy loss functions. (a) ‘−’ class. (b) ‘+’ class.
Figure 4. Distribution of fluctuations for composition of sigmoid activation and cross-entropy loss functions. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g004
Figure 5. Distribution of fluctuations for composition of ReLU activation and mean-squared loss functions. (a) ‘−’ class. (b) ‘+’ class.
Figure 5. Distribution of fluctuations for composition of ReLU activation and mean-squared loss functions. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g005
Figure 6. Distribution of fluctuations for composition of ReLU activation and power-law loss function of the fourth power. (a) ‘−’ class. (b) ‘+’ class.
Figure 6. Distribution of fluctuations for composition of ReLU activation and power-law loss function of the fourth power. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g006
Figure 7. Distribution of fluctuations for composition of piece-wise linear activation and cross-entropy loss functions. (a) ‘−’ class. (b) ‘+’ class.
Figure 7. Distribution of fluctuations for composition of piece-wise linear activation and cross-entropy loss functions. (a) ‘−’ class. (b) ‘+’ class.
Entropy 27 00989 g007
Figure 8. Results of the experiment with the choice of hyperbolic tangent activation and cross-entropy loss functions for UNN. The graphs show distributions of linearly correlated directions in two spaces, which correspond to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter on logarithmic axes.
Figure 8. Results of the experiment with the choice of hyperbolic tangent activation and cross-entropy loss functions for UNN. The graphs show distributions of linearly correlated directions in two spaces, which correspond to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter on logarithmic axes.
Entropy 27 00989 g008
Figure 9. Results of the experiment with the choice of ReLU activation and mean-squared loss functions for UNN. The graphs show distributions of linearly correlated directions in two spaces, which correspond to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter on logarithmic axes.
Figure 9. Results of the experiment with the choice of ReLU activation and mean-squared loss functions for UNN. The graphs show distributions of linearly correlated directions in two spaces, which correspond to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter on logarithmic axes.
Entropy 27 00989 g009
Figure 10. Results of the experiment with sigmoid activation and cross-entropy loss functions for the CNN, showing distributions of linearly correlated directions in two spaces corresponding to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter.
Figure 10. Results of the experiment with sigmoid activation and cross-entropy loss functions for the CNN, showing distributions of linearly correlated directions in two spaces corresponding to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter.
Entropy 27 00989 g010
Figure 11. Results of the experiment with ReLU activation and mean-squared loss functions for CNN, showing distributions of linearly correlated directions in two spaces corresponding to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter on logarithmic axes.
Figure 11. Results of the experiment with ReLU activation and mean-squared loss functions for CNN, showing distributions of linearly correlated directions in two spaces corresponding to the largest singular value of the cross-covariance matrix. (a) Distribution along the direction in the input data space. (b) Distribution of fluctuations of the trainable parameter on logarithmic axes.
Entropy 27 00989 g011
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kukleva, E.; Vanchurin, V. Dataset-Learning Duality and Emergent Criticality. Entropy 2025, 27, 989. https://doi.org/10.3390/e27090989

AMA Style

Kukleva E, Vanchurin V. Dataset-Learning Duality and Emergent Criticality. Entropy. 2025; 27(9):989. https://doi.org/10.3390/e27090989

Chicago/Turabian Style

Kukleva, Ekaterina, and Vitaly Vanchurin. 2025. "Dataset-Learning Duality and Emergent Criticality" Entropy 27, no. 9: 989. https://doi.org/10.3390/e27090989

APA Style

Kukleva, E., & Vanchurin, V. (2025). Dataset-Learning Duality and Emergent Criticality. Entropy, 27(9), 989. https://doi.org/10.3390/e27090989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop