Estimating Neural Network’s Performance with Bootstrap: A Tutorial

: Neural networks present characteristics where the results are strongly dependent on the training data, the weight initialisation, and the hyperparameters chosen. The determination of the distribution of a statistical estimator, as the Mean Squared Error (MSE) or the accuracy, is fundamental to evaluate the performance of a neural network model (NNM). For many machine learning models, as linear regression, it is possible to analytically obtain information as variance or conﬁdence intervals on the results. Neural networks present the difﬁculty of not being analytically tractable due to their complexity. Therefore, it is impossible to easily estimate distributions of statistical estimators. When estimating the global performance of an NNM by estimating the MSE in a regression problem, for example, it is important to know the variance of the MSE. Bootstrap is one of the most important resampling techniques to estimate averages and variances, between other properties, of statistical estimators. In this tutorial, the application of resampling techniques (including bootstrap) to the evaluation of neural networks’ performance is explained from both a theoretical and practical point of view. The pseudo-code of the algorithms is provided to facilitate their implementation. Computational aspects, as the training time, are discussed, since resampling techniques always require simulations to be run many thousands of times and, therefore, are computationally intensive. A speciﬁc version of the bootstrap algorithm is presented that allows the estimation of the distribution of a statistical estimator when dealing with an NNM in a computationally effective way. Finally, algorithms are compared on both synthetically generated and real data to demonstrate their performance.


Introduction
An essential step in the design of a neural network model (NNM) is the definition of the neural network architecture [1]. In this tutorial, the analysis assumes that the network architecture design phase is completed and the parameters not varied anymore. It is assumed that a dataset S is available. For the training and testing of an NNM, S is split into two parts, and called the training dataset (S T ) and validation dataset (S V ), with S = S T ∪ S V and S T ∩ S V = ∅. The model is then trained on S T . Afterwards, a given statistical estimator, θ, is evaluated on both S T and S V (indicated with θ(S T ) and θ(S V )) to check for overfitting [1]. θ can be, for example, the accuracy in a classification problem, or the Mean Square Error (MSE) in a regression one. θ is clearly dependent (sometimes strongly) on both S T and S V , as the NNM was trained on S T . The metric evaluated on the validation dataset S V may be indicated as to make its dependence on the two datasets S T and S V transparent. The difficulty in evaluating the distributions of θ S T ,S V (S V ) is that it is enough to split the data differently (or in other words, to get different S T and S V ), for the metric θ S T ,S V (S V ) to assume a different value. This poses the question on how to evaluate the performance of an NNM. One of the most important characteristics of an NNM is its ability to generalise to unseen data, or to maintain its performance when applied to any new dataset. If a model can predict a quantity with an accuracy of, for example, 80%, the accuracy should remain around 80% when the model is applied to new and unseen data. However, changing the training data will change the performance of any given NNM. To give an example of why giving one single number to measure the performance of a NNM can be misleading, let us consider the following example. Suppose we are dealing with a dataset where one of the features is the age. What would happen if, for sheer bad luck, one splits the dataset in one (S T ) with only young people, and one (S V ) with only old people? The trained NNM will, of course, not be able to generalize well to age groups that are different from those present in S T . Therefore, the model performance, measured by the the statistical estimator θ, will drop significantly. This problem can only be identified by considering multiple splits and by studying the distribution of θ.
The only possibility to estimate the variance of the performance of the model given a dataset S is to split S in many different ways (and, therefore, obtain many different training and validation datasets). Then, the NNM has to be trained on each training dataset. Finally, the chosen statistical estimator θ can be evaluated on the respective validation datasets. This will allow to calculate the average and variance of θ, and use these values as an estimate of the performance of the NNM when applied to different datasets. This technique will be called the "split/train algorithm" in this tutorial. The major disadvantage of this technique is that it requires to repeat the training of the NNM for every split, and is therefore very time-consuming. If the model is large, it will require an enormous amount of computing time, making it not always a practical approach.
This paper shows how this difficulty can be overcome using resampling techniques to give an estimate of the average and the variance of metrics as the MSE or the accuracy, thus avoiding the training of hundreds or thousands of models. An alternative to resampling techniques are so-called ensemble methods, namely, algorithms that train a set of models and then generate a prediction by taking a combination of each single prediction. The interested reader is referred to [2][3][4][5][6][7][8][9].
The goal of this tutorial is to present the main resampling methods, and discuss their applications and limitations when used with NNMs. With the information contained in this tutorial, a reader with some experience in programming should be able to implement them. This tutorial is not meant to be an exhaustive review of the mentioned algorithms. The interested reader is referred to the extensive list of references given in each section for a discussion of the theoretical limitations.
The main contributions of this tutorial are four. Firstly, it highlights the role of the Central Limit Theorem (CLT) in describing the distribution of averaging statistical estimators, like the MSE, in the context of NNMs. Particularly in this work, it is shown how the distribution of, for example, the MSE will tend to a normal distribution for increasing sample size, thus justifying the use of the average and the standard deviation to describe it. Secondly, it provides a short review of the main resampling techniques (hold-out set approach, leave-one-out cross-validation, k-fold cross-validation, jackknife, subsampling, split/train, bootstrap) with an emphasis on the challenges when using neural networks. For most of the above-mentioned techniques, the steps are described with the help of a pseudo-code to facilitate the implementation. Thirdly, bootstrap, split/train, and the mixed approach between bootstrap and split/train are discussed in more depth, again with the help of the pseudo-code, including the application to synthetic and real datasets. Finally, limitations and promising future research directions in this area are briefly discussed. Details of implementations of resampling techniques on the problem of high complexity goes beyond the scope of this work, and the reader is referred to the following examples [10][11][12].
This tutorial is structured in the following way. In Section 2 the notation is explained, followed by a discussion of the CLT and its relevance for NNMs in Section 3. A short introduction to the idea behind bootstrap is presented in Section 4, while other resampling algorithms are discussed in Section 5. In Section 6, bootstrap, split/train, and the mixed approach between bootstrap and split/train are explained in more detail, and compared. Practical examples with both synthetic and real data are described in Sections 7 and 8, respectively. Finally, an outlook and the conclusions are presented in Sections 9 and 10, respectively.

Notation
n independent, identically distributed (iid) observations will be indicated here with X n ≡ (x 1 , . . . , x n ). This dataset will come from a population described by a probability density function (PDF) F generally unknown: Let us assume that the statistical estimator θ (for example, the average or the mean squared error) is a functional. Loosely speaking, θ will be a mapping from the space of possible PDFs into real numbers R. To make the concept clear, let's suppose that the estimator is the mean of the observations x i . In this case, where it is clear that the right part of Equation (3) is a real number. Unfortunately, in all practical cases, the "real" PDF F is unknown. Given a certain dataset X n , the only possibility is to approximate the estimator θ withθ n ≡ θ(F n ), where F n indicates the empirical distribution obtained from X n by giving a probability of 1/n to each observation x i . This is the idea at the basis of the bootstrap algorithm, as it will be discussed in detail in Section 4.

Central Limit Theorem for an Averaging Estimator θ
A lot of mathematics has been developed to get at least the asymptotic distribution ofθ n for n → ∞ [13][14][15][16]. The CLT [17], also known as the Lindeberg-Lévy central limit theorem, enunciates that, considering a sequence of iid observations x i with µ = E[x i ] (the expected value of the inputs), and where For any practical purpose, if n is large enough, the normal distribution N will give a good approximation of the distribution of the average of a dataset of n iid observations, x i .
Let's consider F to be a chi-squared distribution (notoriously asymmetric) with k = 10 [18] normalized to have the average equal to zero (panel(a) in Figure 1). Let's now calculate the average of X n , as in Equation (5) 10 6 times in three cases: n = 2, 10, 200. The results are shown in Figure 1, panels (b) to (d). When the sample size is small (n = 2, panel(b)), the distribution is clearly not symmetrical, but when the sample size grows (panels (c) and (d)), the distribution approximates the normal distribution. Figure 1 is a numerical demonstration of the CLT.
Typically, when dealing with neural networks both in regression and classification problems, one has to deal with complicated functions like the MSE, the cross-entropy, accuracy, or other metrics [1]. Therefore, it may seem that the central limit theorem does not play a role in any practical application involving NNMs. This, however, is not true. Consider, as an example, the MSE function of a given dataset of input observations x i with average µ It is immediately evident that Equation (6) is nothing other than the average of the transformed inputs (x i − µ) 2 . Note that the CLT does not make any assumption on the distribution of the observations. Thus, the CLT is also valid for the average of observations that have been transformed (as long as the average and variance of the transformed observations remain finite). In other words, it is valid for both x i and (x i − µ) 2 . This can be formalized in the following Corollaries.
for n → ∞ will be the normal distribution where with µ(δ i ) and σ(δ i ), we have indicated the expected value and the standard deviation of the δ i , respectively.
Proof. The first thing to note is that since the x i have finite mean and finite variance, it follows that δ i ≡ (x i − µ) 2 will also have finite mean and finite variance, and therefore the CLT can be applied to the average of the δ i . By applying the CLT to the quantities δ i , Equation (8) is obtained. That concludes the proof.
A numerical demonstration of this result can be clearly seen in Section 7.1. In particular, Figure 2 shows that the distribution of the MSEs approximates N when the sample size is large enough.
Note that Corollary 1 can be easily generalized to any estimator in the form if the quantities g i = g(x i ) have a finite mean g i and variance σ 2 (g i ). For completeness, the Corollary 1 can be written in the general form. Corollary 2. Given a dataset of i.i.d. observations x 1 , . . . , x n with a finite mean µ and variance σ 2 , we define the quantities g i ≡ g(x i ). It is assumed that the average µ(g i ) and the variance σ(g i ) 2 are finite. The limiting form of the distributions of the estimator θ for n → ∞ will be the normal distribution Proof. The proof is trivial, as it is simply necessary to apply the central limit theorem to the quantities g i since the θ is nothing other than the average of those quantities.
The previous corollaries play a major role for neural networks. The implications of the final distributions of averaging metrics being Gaussian are that:

•
The distribution is symmetric around the average, with the same number of observations below and above it; and • The standard deviation of the distribution can be used as a statistical error, knowing that ca. 68% of the results will be in a region of ±σ around the average. These results justify the use of the average of the statistical estimator, such as the MSE, and of its standard deviation as the only parameters needed to describe the network performance.

Bootstrap
Bootstrap is essentially a resampling algorithm. It was introduced by Efron in 1979 [19], and it offers a simulation-based approach for estimating, for example, the variance of statistical estimates of a random variable. It can be used in both parametric and nonparametric settings. The main idea is quite simple and consists of creating new datasets from an existing one by resampling with repetition. The discussion here is limited to the case of n observations that are iid (see Section 2 for more details). The estimator θ calculated on the simulated datasets will then approximate the estimator evaluated on the true population, which is unknown.
The best way to understand the bootstrap technique is to describe it in pseudo-code, as this illustrates its steps. The original bootstrap algorithm proposed by Efron [19], can be seen in pseudo-code in Algorithm 1. In the Algorithm, N s is an integer and indicates the number of new samples generated with the bootstrap algorithm.
Algorithm 1: Pseudo-code of the bootstrap algorithm.
Generate a new sample X * n,i selecting n elements from X n with repetitions; As a consequence of Corollary 2 (Section 3), Algorithm 1 for a large enough N s gives an approximation of the average and variance of the statistical estimator θ. Being the Gaussian distribution (albeit for N s → ∞), these two parameters describe it completely.
An important question is how big N s should be. In the original article, Efron [42] suggests that N s of the order of 100 is already enough to get reasonable estimates. Chernick [26] considers N s ≈ 500 already very large and more than enough. In the latter work, however, it is indicated that, above a certain value of N s , the error is due to the approximation of the true distribution F by the empirical distribution F n , rather than by the low number of samples. Therefore, particularly given the computational power available today, it is not meaningful to argue whether 100 or 500 is enough, as running the algorithm with many thousands of samples will take only a few minutes on most modern computers. In many practical applications, using N s between 5000 and 10,000 is commonplace [26]. As a general rule of thumb, N s is large enough when the distribution of the estimator starts to resemble a normal distribution.
The method described here is very advantageous when using neural networks because it allows an estimation of average and variance of quantities as the MSE or the accuracy without training the model hundreds or thousands of times, as will be described in Section 6. Thus, it is extremely attractive from a computational point of view and is a very pragmatic solution to a potentially very time-consuming problem. The specific details and pseudo-code of how to apply this method to NNMs will be discussed at length in Section 6.

Other Resampling Techniques
For completeness, in this section, additional techniques-namely the hold-out set approach, leave-one-out cross-validation, k-fold cross-validation, jackknife, and subsampling-are briefly discussed, including their limitations. For an in-depth analysis, the interested reader is referred to [43] and to the given literature.

Hold-Out Set Approach
The simplest approach to estimating a statistical estimator is to randomly divide the dataset into two parts: a training S T and a validation dataset S V . The validation dataset is sometimes called a hold-out set (from which derives the name of this technique). The model is trained on the training dataset S T , and then used to evaluate θ on the validation dataset S V . θ(S V ) is used as an estimate of the expected value of θ. This approach is also used to identify whether the model overfits, or, in other words, learns to unknowingly extract some of the noise in the data as if that would represent an underlying model structure [1]. The presence of overfitting is checked by comparing θ(S T ) and θ(S V ). A large difference is an indication that overfitting is present. The interested reader can find a detailed discussion in [1]. Such an approach is widely used, but has two major drawbacks [43]. Firstly, since the split is done only once, it can happen that S V is not representative of the entire dataset, as described in the age example in Section 1. Using this approach would not allow for an identification of such a problem, therefore giving the impression that the model has very bad performance. In other words, this method is highly dependent on the dataset split. The second drawback is that splitting the dataset will reduce the number of observations available in the training dataset, therefore making the training of the model less effective.
The techniques explained in the following sections try to address these two drawbacks with different strategies.

Leave-One-Out Cross-Validation
Leave-one-out cross-validation (LOOCV) can be well-understood with the pseudocode outlined in Algorithm 2. This approach has the clear advantage that the model is trained on almost all observations. Therefore, we address the second drawback of the hold-out approach, namely that there are less observations for training. This also has the consequence that the LOOCV tends not to overestimate the estimate of the statistical estimator θ [43] as much as the hold-out approach. The second major advantage is that this approach will not present the problem that was described in the age example in the introduction as the training dataset will include almost all observations, and it will vary n times.
The approach, however, has one major drawback: it is very computationally expensive to implement, if the NNM training is resource-intensive. In all medium to large NNM models, this approach is simply not a realistic possibility.

k-Fold Cross-Validation
k-fold cross-validation (k-fold CV) is similar to LOOCV but tries to address the drawback that the model has to be trained n times. The method involves randomly dividing the dataset in k groups (also called folds) of approximately equal size. The method is outlined in pseudo-code in Algorithm 3.

Algorithm 3: k-fold cross-validation (k-fold CV) algorithm.
Result: Estimate of a statistical estimator:θ. 1 Define an NNM by fixing the hyperparameters; 2 Split the dataset in k groups (folds): S (i) , with i = 1, . . . , k; 3 for i = 1, . . . , k do 4 Train the NNM on the dataset k i=1,i =j S (i) , or in other words the dataset without the i th fold ; 5 Evaluate the statistical estimatorθ (i) by evaluating it on S (i) ; Therefore, LOOCV is simply a special case of k-fold CV for k = n. Typical k values are 5 to 10. The main advantage of this method with respect to LOOCV is clearly computational. The model has to be trained only k times instead of n.
When the dataset is not big, k-fold CV has the drawback that it reduces the number of observations available for training and for estimating θ, since each fold will be k times smaller than the original dataset. Additionally, it may be argued that using only a few values of the statistical estimator (for example, 5 or 10) to study its distribution is questionable [43].

Jackknife
The jackknife algorithm [44,45] is another resampling technique that was first developed by Quenouille [44] in 1949. It consists of creating n samples by simply removing one observation each time from the available x 1 , . . . , x n . For example, one jackknife sample will be x 2 , . . . , x n , with n − 1 elements. To estimate θ, the statistical estimator will be evaluated on each sample (of size n − 1). Note that the jackknife may seem to be the exact same method as LOOCV, but there is one major difference that is important to highlight. While in LOOCV,θ is evaluated on the i th observation held out (in other words, on one single observation), in jackknife,tθ is evaluated on the remaining n − 1 observations. With this method, it is only possible to generate n samples that can then be used to evaluate an approximation of a statistical estimator. This is one significant limitation of the method compared to bootstrap. If the size of the dataset is small, only a limited number of samples will be available. The interested reader is referred to the reviews [28,29,[46][47][48][49][50]. A second major limitation is that the jackknife estimation of an averaging estimator θ coincides with the average and standard deviation of the observations [26]. Thus, using jackknife is not helpful to approximate θ(F).
For the limitations discussed above, this technique is not particularly advantageous when dealing with NNMs, and therefore, is seldom used in such a context, especially when compared with the bootstrap algorithm and its advantages.

Subsampling
Another technique for resampling is subsampling, achieved by simply choosing from a dataset X n with n elements, m < n elements without replacement. As a result, the samples generated with this algorithm have a smaller size than the initial dataset X n . As the bootstrap algorithm, this one has been widely studied and used in the most different fields, from genomics [51,52] to survey science [53,54], finance [55,56] and, of course, statistics [26,57,58]. The two reviews [26,59] can be consulted by the interested reader. Precise conditions under which approximating with subsampling lead to a good approximation of the desired estimator can be found in [59][60][61][62].
Subsampling presents a fundamental difficulty when dealing with the average as a statistical estimator. By its own nature, subsampling requires to consider a sample of smaller size m than the available dataset (of size n). As seen previously, the CLT enunciates that the standard deviation of a sample of size m will tend asymptotically to a normal distribution with a standard deviation that is proportional to the inverse of √ m. That means that changing the sample size changes the standard deviation of the distribution of θ. Note that this is not a reflection of properties of the MSE, but only of the sample chosen. In the extreme case that m = n (one could argue that this is not subsampling anymore, but let's consider it as an extreme case) the average estimator will always have the same value, exactly the average of the inputs, since the subsampling samples are without replacement, and therefore the standard deviation will be zero. On the other hand, if m = 1, the standard deviation will increase significantly and will coincide with the standard deviation of the observations. Therefore, the subsampling method presents the fundamental problem of the choice of m. Since there is not a general criterion to choose m, the distribution of θ will reflect the arbitrary choice of m and the properties of the data at the same time. This is why the authors do not think that the subsampling method is well-suited to give a reasonable and interpretable estimate of the distribution of a statistical estimator, as the MSE.

Algorithms for Performance Estimation
As discussed in Section 1, the performance of an NNM can be assessed by estimating the variance of the statistical estimator θ. The distribution of θ can be evaluated by the split/train algorithm by splitting a given dataset S randomly N s times in two parts S . . , N s . This algorithm is described with the help of the pseudo-code in Section 6.1. This approach is unfortunately extremely time-consuming, as the training of the NNM is repeated N s times.
As an alternative, the approach based on bootstrap is discussed in Section 6.2. This algorithm has an advantage over the split/train algorithm in being very time-efficient, since the NNM is trained only once. After training, the distribution of a statistical estimator is then estimated by using a bootstrap approach on S V .

Split/Train Algorithm
The goal of the algorithm is to estimate the distribution of a statistical estimator, like the MSE or accuracy, when considering the many variations of splits, or in other words, the different possible S T and S V . To do this, for each split a new model is trained, so as to include the effect of the change of the training data.
The algorithm is described in pseudo-code in Algorithm 4. First, the dataset is ran- T indicates the dataset obtained by removing all x i ∈ S (i) T from S. Then, the training is performed, and finally, the distribution of a statistical estimator is evaluated.
It is important to note that Algorithm 4 can be quite time-consuming since the NNM is trained N s times. Thus, if the training of a single NNM takes a few hours, Algorithm 4 can easily take days and therefore may not be of any practical use. Remember that, as discussed previously, N s should be at least of the order of 500 for the results to be meaningful. Larger values for N s should be preferred, making this algorithm in many cases of no practical use.
From a practical perspective, besides the issue of the time, care must be taken in the implementation when automatizing Algorithm 4. In fact, if a script trains hundreds of models, it may happen that some will not converge. The results of these models will, therefore, be quite different from all the others. This may skew the distribution of the estimator. So, it is necessary to check that all the trained models reach approximately the same value of the loss function. Models that do not converge should be excluded from the analysis, as they will clearly falsify the results.
It is important to note that the estimate of the distribution of an averaging estimator as the MSE will always depend on the data used. Therefore, the method allows to assess the performance of an NNM, measured as its generalisation ability when applied to unseen data.

Bootstrap
This section describes the application of bootstrap to estimate the distribution of a statistical estimator. Let's suppose one has an NNM trained on a given training dataset S T and is interested in finding an estimate of the distribution of a statistical estimator, for example, the MSE or the accuracy. In this case, one can apply bootstrap, as described in Algorithm 1, to the validation dataset. The steps necessary are highlighted in pseudo-code in Algorithm 5.
Note that Algorithm 5 does not require training of an NNM multiple times and is, therefore, quite time-efficient. From a practical perspective, it is important to note that the results of Algorithm 5 (θ n and σ(θ n )) approximate the ones from Algorithm 4 (θ V and σ(θ V )). In fact, the main difference between the algorithms is that in Algorithm 4, an NNM is trained each time on new data, while in Algorithm 5, the training is performed only once. Assuming that the dataset is big enough and that the trained NNMs converge to similar minima of the loss functions, it is reasonable to expect that their results will be comparable.

Mixed Approach between Bootstrap and Split/Train
The bootstrap approach, as described in the previous section, is computationally extremely attractive, but has one major drawback that needs further discussion. Similarly to the hold-out technique, the estimate of the average of the MSE and its variance are strongly influenced by the split: If S V is not representative of the dataset (see the age example in Section 1), the Algorithm 5 will give the impression of bad performance of the NNM.

Algorithm 5:
Bootstrap algorithm applied to the estimation of the distribution of a statistical estimator.
Result: Average and standard deviation of a statistical estimator:θ n and σ 2 (θ n ) 1 Define an NNM by fixing the hyperparameters; 2 Create S T from S by choosing m elements randomly from S with m < n without replacement; 3 S V ← S \ S T ; 4 Train the NNM on S T ; 5 for i = 1, . . . , N s do 6 Generate a new validation dataset S A strategy to avoid this drawback is to run Algorithm 5 on the data a few times using different splits. As a rule of thumb, when the the average of the MSE and its variance obtained by the different splits are comparable, these will likely be due to the NNM and not to the splits considered. Normally, considering 5 to 10 splits will give an indication of whether the results can be used as intended. This approach has the advantage of being able to use a large number of samples (the number of bootstrap samples) to estimate a statistical estimator, without being insensitive to possible problematic cases due to splits where training and test parts are not representative of each other and of the entire dataset.

Application to Synthetic Data
To illustrate the application of bootstrap and to show its potential compared to the split/train approach, let's consider a regression problem. A synthetic dataset (x i , y i ) with i = 1, . . . , n was generated with Algorithm 6. All the simulations in this paper were done with n = 500. The data correspond to a quadratic polynomial to which random noise taken from a uniform distribution was added. The goal in this problem is to extract the underlying structure (the polynomial function) from the noise. A simple NNM was used to predict the y i for each x i . The NNM consists of a neural network with two layers, each having four neurons with the sigmoid activation functions, trained for 250 epochs, with a mini-batch size of 16 with the Adam optimizer [63]. Classification problems can be treated similarly and will not be discussed explicitly in this tutorial.

Results of Bootstrap
After generating the synthetic data, the dataset was split in 80%, used as a training dataset (S T ), and 20% used for validation (S V ). Then, the bootstrap Algorithm 5 was applied, training the NNM on S T and generating 1800 bootstrap samples from the S V datasets. Finally, the MSE metricθ n on the bootstrap samples was evaluated, and its distribution plotted in Figure 2. The black line is the distribution of the MSE on the 1800 bootstrap samples, while the red line is a Gaussian function with the same mean and standard deviations as the evaluated MSE values. Figure 2 shows that, as expected from Corollary 1, the distribution of the MSE values has a Gaussian shape. This justifies, as discussed, the use of the average and standard deviation of the MSE values to completely describe their distribution. V . N s = 1800 was used for both Algorithms 4 and 5. The corresponding averages of the metric distributions (vertical dashed lines) are very similar for the shown cases. Note that, depending on the split, the NNM may learn better or worse and, therefore, the average of MSE obtained by using Algorithm 4 may vary, although most of the cases will still be between roughly one σ of the average obtained by Algorithm 4. The second observation is that, although the average of the MSE may vary a bit, its variance stays quite constant. Finally, the results show that, as expected, the distributions have a Gaussian shape, as expected from Corollary 1. As it was mentioned, Algorithm 5 is computationally more efficient. For example, on a system with an Intel 2.3 Ghz 8-Core i9 CPU, Algorithm 5 took less than a minute, while Algorithm 4 took over an hour.

Comparison of Split/Train and Bootstrap Algorithms
The comparison between the split/train and bootstrap algorithms is summarized in Table 1, where the average of the MSE, its variance, and the running time are listed. In the Table, the results of k-fold cross-validation and of the mixed approach (a few split/train steps combined with several bootstrap samples) are also reported. Note that for the mixed approach, the reported average of the MSE and σ are obtained over the several splits.
It is important to note that, in the split/train algorithm, the NNM was trained on 100 different splits, and in the bootstrap algorithm, only on one. To avoid the dependence of the average of the MSE and of its variance on the single split, the mixed approach offers the possibility to check for dependencies from the split still remaining computationally efficient. For comparison, the k-fold CV is computationally efficient, but the distribution of the MSE is composed of very limited (here k = 5) values. Thus, even if the results of Table 1 are numerically similar, the difference is in the computation time and robustness of these values.

Application to Real Data
To test the different approaches on real data, the Boston dataset [64] was chosen. This dataset contains house prices collected by the US Census Service in the area of Boston, US. Each observation has 13 features and the target variable is the average price of a house with those features. The interested reader can find the details in [65].
The results are summarized in Table 2. As visible from Table 2, the average of the MSE and its variance obtained with the different algorithms are numerically similar, as obtained on the synthetic dataset. Here, as in the example of Section 7.2, the difference is also in the computation time and robustness of these values.

Limitations and Promising Research Developments
It is important to note that the algorithms discussed in this work are presented as practical implementation techniques, but are not based on theoretical mathematical proof, with the exception of Section 3. One of the main obstacles in proving such results is the intractability of neural networks (see, for example, [66]) due to their complexity and nonlinearity.
Although not yet based on mathematical proofs, those approaches are important tools to be able to estimate the value of a statistical estimator without incurring in problems as described, for example, in the example with the age in the introduction.
The field offers several promising research directions. One interesting question is whether the degradation of the performance of NNMs due to small datasets differs when using the different techniques. This would enable to choose which technique works better with less data. Another promising field is to study different network topologies and the role of the architecture on the resampling results. It is not obvious at all, that different network architectures behave the same when using different resampling results. Finally, it would be important to support the results arising from simulations described in this paper with mathematical proofs. This, in the opinion of the authors, would be one of the most important research directions to pursue in the future.

Conclusions
This tutorial showed how the distributions of an average estimator, as the MSE or the accuracy, tends asymptotically to a Gaussian shape. The estimation of the average and variance of such an estimator, the only two parameters needed to describe its distribution, are therefore of great importance when working with NNMs. They allow to assess the performance of an NNM, perceived as its ability to generalise when applied to unseen data.
Classical resampling techniques were explained and discussed, with a focus on their application with NNMs: the hold-out set approach, leave-one-out cross-validation, k-fold cross-validation, jackknife, bootstrap, and split/train. The pseudo-code included is meant to facilitate the implementation. The relevant practical aspects, as with the computation time, were discussed. The application and performance of bootstrap and split/train algorithms were demonstrated with the help of synthetically generated and real data.
The mixed bootstrap algorithm was proposed as a technique to obtain reasonable estimates of the distribution of statistical estimators in a computationally efficient way. The results are comparable with the ones obtained with the much more computationallyintensive algorithms, like the split/train one.

Software
The code used in this tutorial for the simulations is available at [67].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

MSE
Mean Square Error NNM Neural Network Model CLT Central Limit Theorem iid independent identically distributed PDF Probability Density Function LOOCV Leave-one-out cross-validation CV cross-validation