Effect of the initial configuration of weights on the training and function of artificial neural networks

The function and performance of neural networks is largely determined by the evolution of their weights and biases in the process of training, starting from the initial configuration of these parameters to one of the local minima of the loss function. We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer ReLU networks of various sizes trained via Stochastic Gradient Descent (SGD) from their initial random configuration. We compare the evolution of the distribution function of this deviation with the evolution of the loss during training. We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights. For each initial weight of a link we measured the distribution function of the deviation from this value after training and found how the moments of this distribution and its peak depend on the initial weight. We explored the evolution of these deviations during training and observed an abrupt increase within the overfitting region. This jump occurs simultaneously with a similarly abrupt increase recorded in the evolution of the loss function. Our results suggest that SGD's ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.


Introduction
Training of neural networks is based on the progressive correction of their weights and biases (model parameters) performed by such algorithms as gradient descent which compare actual outputs with the desired ones for a large set of input samples (LeCun et al., 2015). Consequently, the understanding of the function of neural networks should be intrinsically based on the detailed knowledge of the evolution of their weights in the process of training, starting from their initial configuration. Recently, Li & Liang (2018) revealed that, during training, weights in neural networks only slightly deviate from their initial values in most practical scenarios. In this paper, we explore in detail the role of the initial configuration of neural networks' weights in their training and function. We scan the evolution of the weights of networks consisting of two ReLU hidden layers trained on three different classification tasks with Stochastic Gradient Descent (SGD), and measure the dependence of the distribution of deviations from an initial weight on this initial value. In all of our experiments, we observe no inconsistencies on the results of the three tasks.
By considering networks of different sizes we observe that, in order to reach an arbitrarily chosen loss value, the weights of larger networks tend to deviate less from their initial values (on average) than the weights of smaller networks. This suggests that larger networks tend to converge to minima which are closer to their initialization. On the other hand, we observe that for a certain range of network sizes the deviations from initial weights abruptly increase at some moment during their training within the overfitting regime, see Fig. 1. We find that this sharp increase closely correlates with the crossover between two regimes of the network-trainability and untrainability-occurring in the course of the training. Finally, we measure the dependence of the time at which these crossovers happen on the network size and build a diagram showing the network's training regime for each network width and training time. This diagram, Fig. 2 (c), resembles a phase diagram, although the variable t, the training time, is not a control parameter, but it is rather the measure of the duration of the 'relaxation' process that SGD training represents.
One may speak about a phase transition in these systems only in respect of their stationary state, that is, the Figure 1: Train and test loss of networks consisting of two equally sized hidden layers of nodes, trained on HASYv2. Some of the weights connecting the two hidden layers were initially set to zero so that the weight matrix of these layers resembles the letter a (at initialization). The evolution of the weight matrix is shown in the subplots. (a) Loss of a stable learner network with 512 nodes in each hidden layer. (b) Training loss of an unstable network with 256 nodes in each hidden layer, illustrating the effect of crossing over from trainability to untrainability regimes on the network's weights (disappearance of the initialization mark). A single curve is shown for clarity, but networks of the same width show the same behavior. Experiments with other symbols exhibit similar behavior. regime in which they finally end up, after being trained for a very long (infinite) time. Figure 2 (c) shows three characteristic instants (times) of training for each network width: (i) the time at which the minimum of the test loss occurs, (ii) the time of the minimum of the train loss, (iii) the time at which loss abruptly increases ('diverges'). Each of these times differ for different runs, and, for some widths, these fluctuations are strong or they even diverge. The points in this plot are the average values over ten independent runs. In the error bars we show the scale of the fluctuations between different runs. Notice that the times (ii) and (iii) approach infinity as we approach the threshold of about 300 nodes from below (which is specific to the networks' architecture and dataset). Therefore, wide networks ( 300 nodes in each hidden layer) never cross over to the untrainability regime; wide networks should to stabilize in the trainability regime as t → ∞. The untrainability region of the diagram exists only for widths smaller than the threshold, which is in the neighborhood of 300 nodes. Networks with such widths initially demonstrate a consistent decrease of the train loss. But, at some later moment, during the training process, the systems abruptly cross over from the trainability regime, with small deviations of weights from their initial values and decreasing train loss, to the untrainability regime, with large loss and large deviations of weights.
By gradually reducing the width, and looking at the trainability regime in the limit of infinite training time, we find a phase transition from trainable to untrainable networks. In the diagram of Fig. 2 (c), this transition corresponds to an horizontal line at t = ∞, or, equivalently, the projection of the diagram on the horizontal axis (notice that the border between regimes is concave). Note that we use the term trainability/untrainability referring to the regimes of the training process, in which loss and deviations of weights are, respectively, small/large. We reserve the terms trainable-untrainable transition to refer to the capability of a network to keep a low train loss after infinite training, which depends essentially on the network's architecture.
Our paper is organized as follows. In Sec. 2 we summarize some background topics on neural networks' initializations, and review a series of recent papers pertaining to ours. Section 3 presents the experimental settings and datasets used in this paper. In Sec. 4 we explore the shape of the distribution of the deviations of weights from their initial values and its dependence on the initial weights. We continue these studies in Sec. 5 by experimenting with networks of different widths and find that, whenever a network's training is successful, the network does not travel far from its initial configuration. Finally, Sec. 6 provides concluding remarks and points out directions for future research.

Previous works
It is widely known that a neural network's initialization is instrumental in its training (LeCun et al., 1998b;Yam & Chow, 2000;Glorot & Bengio, 2010;He et al., 2015). The works of Glorot & Bengio (2010), Chapelle & Erhan (2011) and Krizhevsky et al. (2012), for instance, showed that deep networks initialized with random weights and optimized with methods as simple as Stochastic Gradient Descent could, surprisingly, be trained successfully. In fact, by combining momentum with a well-chosen random initialization strategy, Sutskever et al. (2013)   (c) Average times (in epochs) taken by the networks to reach the minima of the train loss and of the test loss functions, and to diverge (i.e., reach the plateau of the train loss). For each network width, we calculate the averages and standard deviations of these times (represented by error bars) over ten independent runs trained on HASYv2. (d) Average values of minimum loss at the train and test sets reached during individual runs (different runs reach their minima at different times). These averages were measured over the same ten independent realizations as in panel(c).
achieve performance comparable to that of Hessian-free methods.
There are many methods to randomly initialize a network. Usually, they consist of drawing the initial weights of the network from uniform or Gaussian distributions centered at zero, and setting the biases to zero or some other small constant. While the choice of the distribution (uniform or Gaussian) does not seem to be particularly important (Goodfellow et al., 2016), the scale of the distribution from which the initial weights are drawn does. The most common initialization strategies -those of Glorot & Bengio (2010), He et al. (2015), and LeCun et al. (1998b)define rules based on the network's architecture for choosing the variance that the distribution of initial weights should have. These and other frequently used initialization strategies are mainly heuristic, seeking to achieve some desired properties at least during the first few iterations of training. However, it is generally unclear which properties are kept during training or how they vanish (Goodfellow et al., 2016, Sec. 8.4). Moreover, it is also not clear why some initializations are better from the point of view of optimization (i.e., achieve lower training loss), but are simultaneously worse from the point of view of generalization.
Frankle & Carbin (2019) recently observed that randomly initialized dense neural networks typically contain subnetworks (called winning tickets) that are capable of matching the test accuracy of the original network when trained for the same amount of time in isolation. Based on this observation they formulate the Lottery Ticket Hy-pothesis, which essentially states that this effect is general and manifests with high probability in this kind of networks. Notably, these subnetworks are part of the network's initialization, as opposed to an organization that emerges throughout training. The subsequent works of Zhou et al. (2019) and Ramanujan et al. (2019) corroborate the Lottery Ticket Hypothesis and propose that winning tickets may not even require training to achieve quality comparable to that of the trained networks.
In their recent paper, Li & Liang (2018) established that two-layer over-parameterized ReLU networks, optimized with SGD on data drawn from a mixture of wellseparated distributions, probably converge to a minimum close to their random initializations. Around the same time, Jacot et al. (2018) proposed the neural tangent kernel (NTK), a kernel that characterizes the dynamics of the training process of neural networks in the so-called infinite-width limit. These works instigated a series of theoretical breakthroughs, such as the proof that SGD can find global minima under conditions commonly found in practice (e.g., over-parameterization) (Du et al., 2019a,b;Allen-Zhu et al., 2019a,b,c;Oymak & Soltanolkotabi, 2019;Oymak & Soltanolkotabi, 2020;Zou et al., 2020), and that, in the infinite-width limit, neural networks remain in an O (1/ √ n) neighborhood of their random initialization (n being the width of the hidden layers) (Arora et al., 2019a,b). Lee et al. (2019) make a similar claim about the distance a network may deviates from its linearized version. Chizat et al. (2019), however, argue that such wide networks operate in a regime of "lazy training" that appears to be incompatible with the many successes neural networks are known for in difficult, high dimensional tasks.

Our contribution
From distinct perspectives, these previous works have shown that, in high over-parametrized networks, the training process consists in a fine-tuning of the initial configuration of weights, adjusting significantly just a small portion of them (the ones belonging to the winning tickets). Furthermore, Frankle et al. (2020) recently showed that the winning ticket's weights are highly correlated with each other. The effects of having a winning ticket are visible in Fig. 3. This figure demonstrates the highly structured correlation between the weights before and after training, including that most of them are left essentially unchanged by the training. Our results suggest that the role of the over-parametrized initial configuration is actually decisive in successful training: when we reduce the level of over-parametrization to a point where the initial configuration stops containing such winning tickets, the network becomes untrainable by SGD.
The previous investigations on the role of the initial weights configuration focus on large networks, in which, as our results also show, the persistence of the initial configuration is more noticeable. In contrast, we explore a wide range of network sizes from untrainable to trainable by varying the number of units in the hidden layers. This approach allows us to explore the limits of trainability, and characterize the trainable-untrainable network transition that occurs at a certain threshold of the width of hidden layers, see Fig. 2.
A few recent works (Lu et al., 2017; indicated the existence of 'phase transitions' from narrow to wide networks associated to qualitative changes in the set of loss minima in the configuration space. These results resonate with ours, although neither relations to trainability nor the role of the initial configuration of weights were explored. On one hand, we observe that, when the networks are trainable (large networks), they always converge to a minima in the vicinity of the initial weight configuration. On the other hand, when the network is untrainable (small networks) the weight configuration drifts away from the initial one. There is an intermediate size range for which the networks train reasonably well for a while, decreasing the loss consistently, but, later in the overfitting region, their loss abruptly increases dramatically (due to overshooting). Past this point of divergence, the loss can no longer be reduced by more training. The behavior of these ultimately untrainable networks further emphasizes the connection between trainability (ability to reduce train loss) and proximity to the initial configuration: the distance to the initial configuration remains small in the first stage of training, while the loss is reduced, and later increases abruptly, simultaneously with the loss.
We hypothesize that networks initialized with random weights and trained with SGD can only find good minima in the vicinity of the initial configuration of weights. The training process has no ability to effectively explore more than a relatively small region of the configuration space around the initial point.

Datasets and experimental settings
Throughout this paper we use three datasets to train our networks: MNIST, Fashion MNIST, and HASYv2. Figure 4 displays samples of them.
MNIST 1 (LeCun et al., 1998a) is a database of grayscale handwritten digits. It consists of 60 000 training and 10 000 test images of size 28×28, each showing one of the numerals 0 to 9. It was chosen due to its popularity and widespread familiarity.
Fashion MNIST 2 (Xiao et al., 2017) is a dataset intended to be a drop-in replacement for the original MNIST dataset for machine learning experiments. It features 10 classes of clothing categories (e.g., coat, shirt, etc) and it is otherwise very similar to MNIST. It also consists of 28×28 gray-scale images, 60 000 samples for training, and 10 000 for testing.  Fig. 1 (a), trained for 1000 epochs on HASYv2, as function of their initial value. The peak of the distribution is at w f = w i , which is extremely close to the median. The skewness of the distribution for large absolute values of w i is evidenced in the histograms at the top.
HASYv2 3 (Thoma, 2017) is a dataset of 32×32 binary images of handwritten symbols (mostly L A T E X symbols, such as α, σ, , etc). It mainly differentiates from the previous two datasets in that it has many more classes (369) and is much larger (containing around 150 000 train and 17 000 test images). We trained feedforward neural networks with two layers of hidden nodes (three layers of links), each containing between 10 to 1000 units. The two hidden layers employ the ReLU activation function, whereas the output layer employs softmax. This architecture is largely based on the multilayer perceptron created by Keras for MNIST 4 . Unless otherwise stated, the biases of the networks are initialized at zero and the weights are initialized with Glorot's uniform initialization (Glorot & Bengio, 2010): where U (−x, x) is the uniform distribution in the interval (−x, x), and m and n are the number of units of the layers that weight w ij connects. In some of our experiments we apply various masks to these uniformly distributed weights, setting to zero all weights w ij uncovered by a mask (see Fig. 1). The loss function minimized is the categorical cross-entropy, i.e., where C is the number of output classes, y i ∈ {0, 1} the i-th target output, and o i the i-th output of the network. The neural networks were optimized with Stochastic Gradient Descent with a learning rate of 0.1 and in minibatches of size 128. The networks were defined and trained in Keras (Chollet et al., 2015) using its TensorFlow (Abadi et al., 2015) back-end. We typically trained networks for very long periods (1000 epochs), and consequently for most of their training the networks were in the overfitting regime. However, since we are studying the training process of these network and making no claims concerning the networks' ability to generalize on different data, overfitting does not affect our conclusions. In fact, our results are usually even stronger prior to overfitting. For similar reasons, we will be considering only the loss function of the networks (and not other metrics such as their accuracy), since it is the loss function that the networks are optimizing.

Statistics of deviations of weights from initial values
To illustrate the reduced scale of the deviations of weights during the training, let us mark a network's initial configuration of weights using a mask in the shape of a letter, and observe how the marking evolves as the network is trained 5 . Naturally, if the mark is still visible after the weights undergo a large number of updates and the networks converge, one may conclude that the training process does not shift the weights of a network far from their initial state.
Figure 1 (a) shows typical results of training a large network whose initial configuration is marked with the letter a. One can see that the letter is clearly visible after training for as many as 1000 epochs. In fact, one observes the initial mark during all of the network's training, without any sign that it will disappear. Even more surprisingly, these marks do not affect the quality of training. Independently of the shape marked (or whether there is a mark or not) the network trains to approximately the same loss across different realizations of initial weights. This demonstrates that randomly initialized networks preserve features of their initial configuration along their whole training-features that are ultimately transferred into the networks' final applications. Figure 1 (b) demonstrates an opposite effect for midsize networks that cross over between the regimes of trainability and untrainability. As it illustrates, the initial configuration of weights of these unstable networks tend to be progressively lost, suffering the largest changes when the networks diverge (loss function sharply increases at some moment).
By inspecting the distribution of the final (i.e., last, after training) values of the weights of the network of Fig. 1 (a) versus their initial values, portrayed in Fig. 3, we see that weights that start with larger absolute values are more likely to suffer larger updates (in the direction that their sign points to). This tendency can be observed on the plot by the tilt of the interquartile range (yellow region in the middle) with respect to the median (dotted line). The figure demonstrates that initially large weights in absolute value have a tendency to become even larger, keeping their original sign; it also shows the maximum concentration of weights near the line w f = w i , indicating that most weights either change very little or nothing at all throughout training.
The skewness in the distribution of the final weights can be explained by the randomness of the initial configuration, which initializes certain groups of wights with more appropriate values than the others, making few weights better for certain features of the dataset that is being used for training the network. This subset of weights does not need to be particularly good, but as long as it provides slightly better or more consistent outputs than the rest of the weights, the learning process will favor their training, improving them further in comparison to the remaining weights. Over the course of many epochs, the small preference that the learning algorithm keeps giving them adds up and cause these weights to become the best recognizers for the features that they initially, by chance, happened to be better at. This effect has several bearing with, for instance, the real-life effect of the month of birth in sports (Helsen et al., 2005). In this hypothesis, it is highly likely that weights with larger initial values are more prone to be deemed more important by the learning algorithm, which will try to amplify their 'signal'.

Evolution of deviations of weights and trainability
One may understand the compatibility between the success of training and the fine-tuning process observed in Sec. 4, during which a large fraction of the weights of a network suffer very tiny updates (and many are not even changed at all), in the following way. We suggest that the neural networks typically trained are so overparameterized that, when initialized at random, their initial configuration has a high probability of being close to a proper minimum (i.e., a global minimum where the training loss approaches zero). Hence, to reach such a minimum, the network needs to adjust its weights only slightly, which causes its final configuration of weights to have strong traces of the initial configuration (in agreement with our observations).
This hypothesis raises the question of what happens when we train networks that have a small number of parameters. At some point, do they simply start to train worse? Or do they stop training at all? It turns out that, during the course of their training, neural networks do cross over between two regimes-trainability and untrainability. The trainability region may be further split into two distinct regimes: a regime of high trainability where training drives the networks towards global minima (with zero training loss), and a regime of low trainability where the networks converge to sub-optimal minima of significantly higher loss. Only high trainability allows a network to train successfully (i.e., to be trainable), since in the remaining two regimes, of untrainability and low trainability, either the networks do not learn at all, or they learn but very poorly. Figure 2 illustrates these three regimes.
The phase diagram in this figure (the bottom left panel) was obtained by analysing the temporal evolution of the train and test loss functions. It demonstrates the existence of three different classes of networks: weak, strong, and unstable learners. Weak learners are small networks that, over the course of their training, do not tend to diverge, but only train to poor minima. They mostly operate in a regime of low trainability, since they can be trained, but only to ill-suited solutions. On the other spectrum of trainability are large networks. These are strong learners, as they train to very small loss values and they are not prone to diverge (they operate mostly in the regime of high trainability). In between these two classes are unstable learners, which are midsize networks that train to progressively better minima (as their size increases), but that over the course of their training are likely to diverge and become untrainable (i.e., they tend to show a crossover from the regime of trainability to the one of untrainability).
Remarkably, we observe the different regimes of operation of a network not only from the behavior of its loss function, but also from how far it travels from its initial configuration of weights. We have already demonstrated qualitatively in Fig. 1 how the mark of the initial con- figuration of weights of a network persists in large networks (i.e., strong learners that over the course of their training were always on the regime of high trainability), and vanishes for midsize networks that ultimately cross over to the regime of untrainability. In Appendix A we supply detailed description of the evolution of the statistics of weights during the training of the networks used to draw Fig. 2. Figures A.1 and A.2 show that, as the network width is reduced, the highly structured correlation between initial and final weight, illustrated by the coincidence of the median with the line w f = w i (see Fig. 3), remains in effect in all layers of weights until the trainability threshold. Below that point the structure of the correlations eventually breaks down, given enough training time.
To quantitatively describe how distant a network gets from its initial configuration of weights we consider the root mean square deviation (RMSD) of its system of weights at time t with respect to its initial configuration, i.e., where m is the number of weights of the network (which depends on its width), and w j (t) is the weight of edge j at time t. Figure 5 plots, for three different datasets, the evo-lution of the loss function of networks of various widths alongside the deviation of their configuration of weights from its initial state. These plots evidence the existence of a link between the distance a network travels away from its initialization and the regime in which it is operating, which we describe below. For all the datasets considered, the blue circles (•) show the training of networks that are weak learnershence, they only achieve very high losses and are continuously operating in a regime of low trainability. These networks experience very large deviations on their configuration of weights, getting further and further away from their initial state. In contrast, the red left triangles ( ) show the training of large networks that are strong learners (in fact, for MNIST all the networks marked with triangles are strong learners; in our experiments we could not identify unstable learners on this dataset). These networks are always operating in the regime of high trainability, and over the course of their training they deviate very slightly from their initial configuration [compare with the results of Li & Liang (2018)]. Finally, for the Fashion MNIST and HASYv2 datasets, orange down ( ) and green up ( ) triangles show unstable networks of different widths (the former being smaller than the latter). While on the regime of trainability, these networks deviate much further from their initial configuration than strong learners (but less than weak learners). However, as they diverge and cross over into the untrainability regime (which could only be observed on networks training with the HASYv2 dataset), the RMSD suffers a sharp increase and reaches a plateau. These observations highlight the persistent coupling between the network's trainability (measured as train loss) and the distance it travels away from the initial configuration (measured as RMSD), as well as their dependence of the networks's width.
To complete the description of the behavior of these networks on the different regimes, Fig. 6 plots, for networks of different widths, the time at which they reach a loss below a certain value θ, and the RMSD between their configuration of weights at that time and the initial one. It shows that networks that are small and are operating under the low trainability regime fail to reach even moderate losses (e.g., on Fashion MNIST, no network of width 10 reaches a loss of 0.1, whereas networks of width 100 reach losses that are three orders of magnitude smaller). Moreover, even when they reach these loss values, they take a significantly larger time to do so, as the plots for MNIST demonstrate. Finally, the figure also shows that, as the networks grow in size, the displacement each weight has to undergo to allow the network to reach a particular loss decreases, meaning that the networks are progressively converging to minima that are closer to their initialization. We can treat this displacement as a measure of the work the optimization algorithm performs during the training of a network to make it reach a particular quality (i.e., value of loss). Then one can say that using larger networks eases training by decreasing the average work the optimizer has to spend with each weight.

Conclusions
In this paper we explored the effects of the initial configuration of weights on the training and function of neural networks. We performed the statistical characterization of the deviation of the weights of two-hidden-layer networks of various sizes trained via Stochastic Gradient Descent from their initial random configuration. We observed that the initial configuration of weights typically leaves recognizable traces on the final configuration after training, which confirms that the learning process is based on fine-tuning the weights of the network. We investigated the conditions under which a network travels far from its initial configurations. We observed that a neural network learns in one of two major regimes: trainability and untrainability. Moreover, its size (number of parameters) largely determines the quality of its training process and its chance to enter the untrainability regime.
We compared the evolution of the distribution function of this deviation with the evolution of the loss during training and observed that over the course of training a network travels considerably far away from its initial configuration of weights only if it is either (i) a poor learner (which means that it never reaches a good minimum) or (ii) when it crosses over from trainability to untrainability regimes. In the alternative (good) case, in which the network is a strong learner and it does not become untrainable, the network always converges to the neighborhood of its initial configuration (keeping extensive traces of its initialization); in all of our experiments, we never observed a network converging to a good minimum outside the vicinity of the initial configuration.
Note that most of our results are for times when overfitting is already taking place. At shorter times, the deviations of weights from their initial values are even smaller, and our conclusions remain valid.
We based our conclusions on the analysis of the loss function of the networks we trained, since this was the actual function that the networks were optimizing. However, equivalent results can be obtained by using the accuracy or other similar metric.
We intend to verify the generality of our results in neural networks of different architectures and with datasets representative of more challenging tasks. We also intend to find out whether our results hold while varying the depth of the networks.  represent the standard deviation measured in ten independent realizations.) To emphasize the coupling between trainability and proximity to the initial configuration of weights, we used data from the same ten realizations to plot Fig. 2 and Figs. A.1 and A.2. These figures combined show the simultaneousness of the abrupt increase of the loss and of the deviation from the initial configuration. For the sake of completeness, we perform linear fittings also to the mean trained weight as a function of the ini-tial weight. Figure A.2 shows the results of these fittings: a 0 and a 1 denote the constant and the slope, respectively. Similarly to the c opt , while above a width of about 300 nodes the values of a 0 and a 1 are stable, below the threshold they suffer an abrupt change at some moment of training. The dispersion of the trained weights around their initial value, measured by the standard deviation, is also shown in Fig. A.2 for the set of weights that are initialized with the value w i = 0, displaying the same transition at a width of about 300. We observed that the distribution of trained weights for other w i = 0 behaves similarly with the variation of the network's width. Notice that, since the weights are initialized from a continuous distribution, we are able to measure the mode (peaks), average, and standard deviation of weights as functions of the initial value w i by applying the special procedure described in Appendix B, which is less affected by the presence of noise (fluctuations) in the data that the standard binning methods.

Appendix B. Fitting the statistics of weights in a single realization
This appendix briefly describes the methods used in this work to characterize the statistics of the displacements of weights produced by training with SGD. The problem is that we cannot directly obtain the distributions P wi (w f ) of the final weights w f for each value of the initial weight w i , because the w i 's are drawn from a continuous (uniform) distribution, see Eq. (1). In practice, for a single realization of training, we have a have a set of points (w i , w f ), one for each link, in the continuous plane, as shown in Fig. 3. In this situation, calculating the mean w f (w i ) and standard deviation σ w f (w i ) of the distribution P wi (w f ) may follow one of two approaches: either using a binning procedure or cumulative distributions. In our analysis, we employed the latter, which is less affected by random fluctuations than the standard binning methods.
We assume a linear fit w f (w i ) = a 0 + a 1 w i , and obtain the constants a 1 and a 0 as follows. Let us define the function W f (w) = w min(wi) w f (x)dx = C + a 0 w + a 1 2 w 2 , (B.1) where C is a constant resulting from the lower limit of the integral. For one given realization, we can estimate this function from the following cumulative sum where w j (t f ) denotes the value of the weight of link j at time t f , N is the number of links of the network, and min(w i )/max(w i ) is the minimum/maximum value of the initialization weights. [The sum in the right-hand side of Eq. (B.2 runs over all links whose initial weight is not larger than w.] Finally, we fit a second degree polynomial to W f (w), and get the constants a 1 and a 0 from its coefficients. We use the same 'cumulative-based' approach to find the second moment of P wi (w f ), denoted by w 2 f (w i ). In this case, we assume the polynomial w 2 f (w i ) = b 0 + b 1 w i + b 2 w 2 i . We again define the cumulative W and fit a third degree polynomial to get the coefficients b 0 , b 1 , and b 2 . Then, we calculate σ w f (w i ) as The method for fitting the peak (or the mode) of the distribution P wi (w f ) is also based on a cumulative distribution. In our experiments we observe that, in the trainability regime, the peak of the distribution of w f as a function of w i is indistinguishable from a straight line, see Fig. 3. Accordingly, we define which is a function that counts the number of points (w i , w f ) below or at the line b + cw i . Then, we fit the peak of P wi (w f ) by optimizing the expression In other words, we look for the slope that causes the largest rate of change in the function N c (b). This slope, c opt , is the slope of the linear function that best aligns with the peak of the distribution P wi (w f ).