Performance Analysis of Multi-Task Deep Learning Models for Flux Regression in Discrete Fracture Networks

: In this work, we investigate the sensitivity of a family of multi-task Deep Neural Networks (DNN) trained to predict ﬂuxes through given Discrete Fracture Networks (DFNs), stochastically varying the fracture transmissivities. In particular, detailed performance and reliability analyses of more than two hundred Neural Networks (NN) are performed, training the models on sets of an increasing number of numerical simulations made on several DFNs with two ﬁxed geometries (158 fractures and 385 fractures) and different transmissibility conﬁgurations. A quantitative evalua-tion of the trained NN predictions is proposed, and rules ﬁtting the observed behavior are provided to predict the number of training simulations that are required for a given accuracy with respect to the variability in the stochastic distribution of the fracture transmissivities. A rule for estimating the cardinality of the training dataset for different conﬁgurations is proposed. From the analysis performed, an interesting regularity of the NN behaviors is observed, despite the stochasticity that imbues the whole training process. The proposed approach can be relevant for the use of deep learning models as model reduction methods in the framework of uncertainty quantiﬁcation analysis for fracture networks and can be extended to similar geological problems (for example, to the more complex discrete fracture matrix models). The results of this study have the potential to grant concrete advantages to real underground ﬂow characterization problems, making computational costs less expensive through the use of NNs.


Introduction
Analysis of underground flows in fractured media is relevant in several engineering fields, e.g., in oil and gas extraction, in geothermal energy production, or in the prevention of geological or water-pollution risk, to mention a few. Many possible approaches exist for modeling fractured media, and among the most used is the Discrete Fracture Network (DFN) model [1][2][3]. In this model, fractures in the rock matrix are represented as planar polygons in a three-dimensional domain that intersect each other; through the intersection segments (called "traces"), a flux exchange between fractures occurs while the 3D domain representing the surrounding rock matrix is assumed to be impermeable. On each fracture, the Darcy law is assumed to characterize the flux and head continuity and flux balance are assumed to characterize all traces.
Underground flow simulations using DFNs can be, however, a quite challenging problem in the case of realistic networks, where the computational domain is often characterized by a high geometrical complexity; in particular, fractures and traces can intersect, forming very narrow angles, or can be very close to each other. These complex geometrical characteristics make the creation of the mesh a difficult task, especially for numerical methods requiring conforming meshes. Therefore, new methods using different strategies have been proposed in the literature to avoid these meshing problems. In particular, in [4][5][6][7], the mortar method is used, eventually together with geometry modifications, while in [8][9][10], lower-dimensional problems are introduced in order to reduce the complexity. Alternatively, a new method that allowed the meshing process be considered an easy task was illustrated in [11][12][13][14][15][16][17]; in this case, the problem was reformulated as a Partial Differential Equation (PDE) constrained optimization one; thanks to this reformulation, totally non-conforming meshes are allowed on different fractures and the meshing process can be independently performed on each fracture. The simulations used in this study are performed with this approach. Other approaches can be found in [18][19][20][21].
In real-world problems, full deterministic knowledge of the hydrogeological and geometric properties of an underground network of fractures is rarely available. Therefore, these characteristics are often described through probability distributions, inferred by geological analyses of the basin [22][23][24][25]. This uncertainty about the subsurface network of fractures implies a stochastic creation of DFNs, sampling the geometric features (position, size, orientation, etc.) and hydrogeological features from the given distributions; then, the flux and transport phenomena are analyzed from a statistical point of view. For this reason, Uncertainty Quantification (UQ) analyses are required to compute the expected values and variances (i.e., the momentums) of the Quantity of Interests (QoI), e.g., the flux flowing through a section of the network. However, UQ analyses typically involve thousands of DFN simulations to obtain trustworthy values of the QoI momentums [26,27] and each simulation may have a relevant computational cost (both in terms of time and memory). Then, it is worth considering some sort of complexity reduction techniques, e.g., in order to speed up the statistical analyses, such as the multi-fidelity approach [28] or graph-based reduction techniques [29].
Machine Learning (ML), and in particular Neural Networks (NNs), in recent years has been proven to be a potential useful instrument for frameworks related to complexity reduction due to their negligible computational cost in making predictions. Some recent contributions involving ML and NNs applied to DFN flow simulations or UQ analysis are proposed in [30][31][32][33][34][35]. To the best of the authors' knowledge, other than [35][36][37] there are no works in the literature that involve the use of NNs as a model reduction method for DFN simulations. In particular, in [35], multi-task deep neural networks are trained to predict the fluxes of DFNs with fixed geometry, given the fracture transmissivities. A well-trained Deep Learning (DL) model can predict the results of thousands of DFN simulations in the order of a second and, therefore, lets a user estimate the entire distribution (not only momentums) of the chosen QoI; the simulations that must be run to generate the training data are the actual computational cost. The results of [35] showed not only that NNs can be useful tools for predicting the flux values in a UQ framework but also that the quality of the flux approximation is very sensitive to some training hyperparameters. In particular, a strong dependence of the performance was observed when the training set size varied.
In this paper, we deeply investigate the dependence of the performances of a trained NN and the size of the training set required for good flux prediction on variance in the stochastic parameter of fracture transmissivities. When variability in the phenomenon increases, good training of an NN requires more and more data. If the data are generated by numerical models, a large number of simulations are necessary for the creation of the dataset involved in training the NN. Then, it can be useful to have a tool that provides an estimate of NN performances for different amounts of training data and for different values of variance in the stochastic input parameters. This issue is relevant to predict the convenience of the approach in real-world applicative situations. Indeed, we recall that the DFN simulations required to generate the training data are the only nonnegligible cost for training NNs on these problems. Therefore, it is important to provide a rule that estimates the number of simulations needed to train an NN model with good performances: if the number of simulations for NN training is less than the number required by a standard UQ analysis, it is convenient to use an NN reduced model; otherwise, other approaches can be considered.
In this work, we take into account the same flux regression problem described in [35] and we explicitly analyze the performance behavior of the NNs trained for DFN flux regression. The analysis is applied to a pair of DFN geometries and to multiple NNs with different architectures and training configurations, showing interesting behaviors that let us characterize the relationship between the number of training data, the transmissivity standard deviation, and the NN performances. From this relationship, we determine a "UQ rule" that provides an estimate of the minimum number of simulations required for training an NN with a flux approximation error less than an arbitrary ε > 0. The rule is validated on a third DFN, proving concrete efficiency and applicability to real-world problems.
The paper is organized as follows. In Section 2, we start with a brief description of the DFN numerical models and their characterization for the analyses of this work; then, we continue with a short introduction on the framework of NNs in order to better describe the concepts discussed in Section 2.2.3 that concerns the application of deep learning models for flux regression in DFNs. In Section 2.3, the performance analysis procedure used in this work is described step by step. In Section 3, we show the application of the analysis described in the previous section and the results obtained for the two test cases considered; in particular, here, we introduce interesting rules that characterize the error behaviors and that are useful for estimating the minimum number of data required for good NN training. We conclude the work with Sections 4 and 5, where the main aspects of the obtained results are commented upon and discussed.

Discrete Fracture Networks
We recall here, for the reader's convenience, the model problem of Discrete Fracture Networks (DFNs). The model is described briefly, and for full details, we point the interested reader to [3,10,11,14]. After the model description, we also introduce the main characteristics of the DFNs used for the performance analysis of Neural Networks (NNs) trained for flux regression.

Numerical Model and Numerical Solution
A DFN is a discrete model that describes an underground network of fractures in a rock medium. In a DFN, the network of fractures is represented as a set of intersecting two-dimensional polygons in a three-dimensional space (see Figure 1). Each one of the polygons that stands for the network fractures was labeled with an index in a set I, and each fracture was denoted by F i , with i ∈ I; then, a DFN was given by the union i∈I F i of all the fractures. The intersections between fractures of the DFN are called traces, and through them, the flow crosses the network.
The flow model we assumed in the network was a simple Darcy model, and the DFN numerical simulation consisted in finding for each i ∈ I the hydraulic head H i of fracture F i . In our simulations, the flow was induced by a fixed pressure difference ∆H between two arbitrary surfaces of the 3D domain of the DFN. In order to solve the problem, some matching conditions were imposed at each trace of the network: we assumed the hydraulic head continuity and the flux balance. In this work, the DFN numerical simulations were computed following the approach described in [11,12], reformulating the problem as a PDE-constrained optimization one and using the finite elements as the space discretization method.
The flow simulation in a DFN was characterized by the geometry and by hydrogeological properties. In particular, for each i ∈ I, the transmissivity parameter κ i of the fracture F i characterized the flow facilitation through the fracture. In the most general case, the fracture transmissivity κ i can be a function of the fracture points, but in this work, we considered it as a constant parameter for each fracture F i .
In this work, we trained the NN regression models to predict the fluxes exiting from the DFN given the set of transmissivities of its fractures for a fixed geometry and ∆H.

DFN Characterization
In the problem addressed in this work, we considered two DFN geometries and the transmissivities were modelled as random variables with a known distribution.
In particular, each DFN considered in this paper was characterized by the following properties: • A fixed geometry with n ∈ N fractures F 1 , . . . , F n in a cubic domain D = [0 , ] 3 ⊂ R 3 with edge length = 1000 m. The fractures were assumed to be octagons and randomly generated as in [22,23], i.e., with geometrical features such that we had the following:

-
The fracture radii were sampled with respect to a truncated power law distribution with exponent γ = 2.5, upper cutoff r u = 560, and lower cutoff r 0 = 50.

-
The mass center of fractures were sampled with respect to a uniform distribution in D. Transmissivities κ 1 , . . . , κ n ∈ R of the n fractures were assumed to be isotropic parameters, modeled as random variables with respect to a log-normal distribution [24,25], i.e., such that log 10 (κ i ) ∼ N (−5, σ) , for each i = 1, . . . , n , where σ ∈ R is an arbitrary parameter that characterizes the standard deviation of the transmissivity distribution. • Due to the fixed geometry of the DFN, the fracture positions in the space did not change. Then, given the fixed pressure difference ∆H, we had that the boundary fractures with flux entering and exiting the network were the ones intersecting the plane x = 0 and the plane x = , respectively, independently from the transmissivity values. We denoted with m ∈ N the number of boundary fractures with exiting flux.

Neural Networks
Neural Networks are powerful ML models that were first introduced more than fifty years ago (see, e.g., [38][39][40]). In the last decade, the usage of NNs has been characterized by incredible growth, thanks to new and more powerful computer hardware and the increase in available data (see Chapter 1 in [41]).
In this section, we recap briefly the main properties of NNs, and in Section 2.2.3, we describe the multi-task architecture adopted for the flux regression problem in DFNs.

General Concepts about Neural Networks and Learning Models
An NN, similar to most other ML models, is a parametric model F w : R n → R m , where the vector of parameters (also called weights) w is adjusted through an error-minimization process in order to approximate a fixed target function F : Ω ⊆ R n → R m , i.e., looking for a final vector of parameters w fin such that F w fin (x) ≈ F(x) for each x ∈ Ω. In typical ML problems, the function F is unknown or requires high computational efforts to be executed, justifying the need to find a computationally cheap approximation of it. The parameters w fin are computed starting from a dataset of pairs: where D ∈ N and D are obtained from observation of real data (when F is unknown) or built after random sampling of D elements x d from the domain Ω and evaluating y d when F is known but computationally expensive. The search for an optimal weight vector, such that F is approximated by F w , is based on a subset T ⊆ D called the training set. The idea is to find a weight vector that minimizes an arbitrary error measure, e.g., the square of the euclidean norm F w (x) − y 2 for each pair (x, F(x) = y) ∈ Ω × R m ; to do so, in NNs typically, we solve a stochastic optimization problem of the following form: where the subset B, also called minibatch, is a set of B ∈ N pairs randomly sampled from T and L( F w , B) is the loss function on B, e.g., the sum of Mean Square Errors (sMSE) over the pairs of the minibatch: where y := F w (x) for each x ∈ R n , and y j and y j are the jth components of y and y, respectively, for each j = 1 , . . . , m (vectors are denoted by boldface symbols). The problem (3) is solved with arbitrary variants of the stochastic gradient descent method (see Chapter 8.3 in [41]), i.e., variants of the gradient iterative method that perform a minimization iteratively on a random minibatch B ⊂ T of fixed size B at each step. In particular, the sampling of B at each step occurs without repetition until all the pairs (x, y) ∈ T have been extracted; once the pairs (x, y) ∈ T are finished, an epoch of the training is completed and a new one starts, until it reaches a stopping criterium. An NN model can be trained for many epochs to improve the approximation of F.
The remaining pairs of data, i.e., the pairs (x, y) ∈ D \ T , are used for monitoring and evaluating the quality of the NN training. Usually, the set D is split in three disjoint subsets: • the training set T (already introduced above), • the validation set V, and • the test set P. The validation set V is often the smallest of the three subsets, and it is used to measure the approximation error of F w at the end of each training epoch; the test set P, on the contrary, is often bigger than V (and sometimes also bigger than T ), and it is used to measure the approximation quality of the trained NN on new data that the model never used during the training phase. Then, P is useful to understand if the NN is a good approximation of the target function for new input data, whereas V is useful to monitor the training activity and to define the proper stopping criteria for avoiding underfitting/overfitting problems (see Chapter 5.2 in [41]).

Neural Network Structure
The structure of an NN is what actually makes it different from other ML models. An NN is a learning algorithm that can be described through an oriented weighted graph (U, A), where U is the set of nodes and A is the set of edges, i.e., the set of ordered pairs of nodes a ij = (u i , u j ) ∈ A ⊆ U × Un each one endowed with a weight w ij ∈ R.
Each node of an NN, also called a unit or a neuron, can receive signals either from other nodes (through the edges) or from one external source; in the latter case, the unit is defined as the input unit of the NN and it returns as output the same signal received as an input. Each non-input unit u j ∈ U performs arbitrary fixed operations on the input signals x i 1 , . . . , x i N (sent by units u i 1 , . . . , u i N ∈ U) and returns an output signal x j . Typically, a unit is characterized as in Figure 2, i.e., with output signal x j such that where f is an arbitrary function called the activation function. If the output x j of u j is sent only to other units of the NN, then u j is defined as a hidden unit; otherwise, if the signal x j is sent "outside" as a general output of the NN model, then u j is defined as an output unit.
One special type of hidden unit is the bias units that are characterized by a fixed unit output. For this reason, Equation (5) is rewritten with a different notation that highlights the action of the bias unit: where b j is the weight of the edge (u bias , u j ) and u bias ∈ U is a bias unit connected to u j . Even if it is not forbidden, in practice, no more than one bias unit is connected to one hidden or output unit. We conclude this brief description of NN structure by introducing the concept of "layers" for NNs. NN architectures organize their units in subsets that interact with each other; these subsets are called layers and can be divided in three main types (as the units): the input layers, the hidden layers, and the output layers.
The simplest type of hidden and output layers are the so-called fully connected layers. A fully connected layer L that receives signals from a layer I is characterized by the fact that each unit in L is connected to all the units in I and that all the units in L have the same activation function f ; then, assuming that all the units in L are characterized by (6), the layer action of the output signals from I can be described as the function L such that where • x (L) ∈ R M and x (I) ∈ R N are the vectors of the output signals of L and I, respectively; • f : R M → R M is the element-wise application of the activation function f of the layer units; • W ∈ R N×M is the weight matrix with entry weights w ij , corresponding to the edge that connects unit u i of I to unit u j of L; and • b ∈ R M is the vector of bias weights for the M units of L. See Figure 3 for a graphical representation of a fully connected layer.
Layer I Layer L W and b Layer formulation in NNs is extremely useful when better describing the function F w as a composition of layer functions and for NN implementation in computer programs. The representation of F w as a composition of functions is used to build a computational graph that is extremely useful in speeding up the computation of both NN outputs and the gradient of the loss function during the training phase. For more details about these properties, see Chapter 6 in [41].

Deep Learning Models for Flux Regression in DFNs
Let us consider a DFN with a geometry defined as in Section 2.1.2, where n is the number of fractures and m is the number of boundary fractures with exiting flux; we recall that the case has fixed geometry and that only the fracture transmissivities change (see (1)). For each vector of transmissivity samples κ = [κ 1 , . . . , κ n ] , we can run a flow simulation for the given DFN and compute the m fluxes ϕ = [ϕ 1 , . . . , ϕ m ] of the m boundary outflowing-flux fractures. Let F : R n → R m be a function related to the given DFN such that for each κ ∈ R n sampled following (1); then, we want to define a multi-task architecture for Deep Neural Network (DNN) models able to approximate F, i.e., able to predict the corresponding fluxes ϕ ∈ R m for each transmissivity vector κ ∈ R n in the input. The DFN flow simulations are performed using the GEO++ software [42], based on a PDEconstrained reformulation of the problem, using finite elements for the spatial discretization. For further details, we point the interested reader to [11,14,42].
For the DNN models of this work, we adopt a multi-task architecture (see Chapter 7.7 in [41]) of the same structure described in [35]. Given a fixed hyperparameter α ∈ N and the target function F, we define the DNN multi-task architecture A α (see Figure 4) such that • A α has one input layer L 0 of n units; • the input layer L 0 is fully connected to the first layer L 1 of a sequence of α fully connected layers: (L 1 , . . . , L α ). All the layers of the sequence have n units characterized by the softplus activation function (i.e., f (x) = log(1 + e x )); We can easily see that the characteristic function of a DNN model with architecture A α accepts n inputs and returns m outputs; then, it is a function F w : R n → R m , and the NN can be trained for approximating the function F.

Input Layer
Output Layer In order to evaluate the quality of the approximation, for each outflow fracture, we consider an error measure that evaluates the relative distance between the flux estimated by the NN and the real flux. Let us assume that we have a trained DNN with function F w approximating F; then, the chosen error measure is the following: • relative error: for each κ ∈ R n , we can measure the vector of prediction errors normalized by the actual total exiting flux, i.e., the vector where ϕ := F w (κ) = [ ϕ 1 , . . . , ϕ m ] is the flux prediction for κ. Then, given the input κ, for each j = 1, . . . , m, the jth element of error (9) tells us how much the prediction for the jth exiting flux differs from the original value, represented as a fraction of the total flux outflowing from the DFN.
The errors introduced above, are very useful in studying the prediction ability of a flux regression NN. However, in order to directly compare the predictions of different NNs, we need to define some cumulative scalar values that summarize the approximation quality of the NN models. Then, we introduce a quantity that is obtained from the aggregation of (9) once an arbitrary test set P of pairs (κ,ϕ) is given. This quantity is • average mean relative error: where e j (κ) is the jth element of the vector e(κ). For simplicity, from now on, we will call the quantity E(P ) simply average error instead of average mean relative error.

Performance Analysis of Deep Learning Models for Flux Regression
In Sections 2.1 and 2.2, we introduced all the notions needed to analyze the performances of multi-task Deep Learning (DL) models trained to predict the fluxes exiting from a DFN. In [35], the NN sensitivity to the cardinality of the training set T was noticed, but in those cases, a fixed value σ = 1/3 for (1) was chosen. In this work, we deeply investigate this sensitivity, quantifying it through the measurement of average errors varying the available number of training data and the sparsity of the data itself.
The analysis performed in this paper compares the average errors E(P ) of a set of NNs with a common architecture A α and the same configuration of training hyperparameters and functions but trained on different training data; in particular, the differences between the training data are characterized by two hyperperparameters: • the parameter σ ∈ R ≥0 , characterizing the standard deviation of the transmissivity distribution (see (1)). This parameter varies among a discrete and finite set of values Σ, arbitrarily chosen; • the number ϑ ∈ N of data (κ, ϕ) used for training the NN, i.e., the sum of the training set and validation set cardinalities: Similar to σ, this parameter varies among a discrete and finite set of values Θ, arbitrarily chosen; In other words, the analysis procedure consists of training (|Σ| · |Θ|) times a fixed untrained NN, each time with respect to a set of training data characterized by a different combination of hyperparameters (σ, ϑ) ∈ Σ × Θ (i.e., with different size ϑ and different sparsity, dependent from σ); then, the performances (average errors) of all the new (|Σ| · |Θ|) trained NNs are measured on test sets of the same size and compared, searching for the behavior of average errors with respect to values of σ and ϑ.

Performance Analysis: Method Description
Let us consider a DFN of the type described in Section 2.1.2 and let be the set of values for the distribution parameter σ that we want to consider for the NN performance analysis; similarly, let be the set of values for the number of training data ϑ considered, and let ρ ∈ N be the chosen cardinality of the test set P used for measuring the average errors and the mean divergences. The method that we use in this work in order to measure and compare the NN performances in flux regression problems for DFNs is characterized by the following steps:

1.
Select a DFN, generated as described in Section 2.1.2, with n fractures and where m of them have exiting fluxes.

2.
Define the arbitrary sets of values Σ and Θ for the hyperparameters that characterize the analysis.

3.
Choose the cardinality ρ = |P | of the test set.
For each σ ∈ Σ, create a test set P σ with a random sampling of ρ pairs from D σ . 7.
Choose a value α ∈ N, and build a not trained NN N with architecture A α . 8.
For each σ ∈ Σ, for each ϑ ∈ Θ, sample randomly a number ϑ of pairs (κ,ϕ) from D σ \ P σ (i.e., the dataset without the test set). We decides to use 20% of the ϑ sampled pairs as the validation set V ϑ σ and the remaining 80% as the training set T ϑ σ .

9.
For each pair (σ, ϑ) ∈ Σ × Θ, train the untrained NN N using the data of T ϑ σ and V ϑ σ , obtaining a trained NN N ϑ σ . For all the cases, the training is characterized by the same hyperparameters and functions arbitrarily chosen. 10. For each pair (σ, ϑ) ∈ Σ × Θ, measure the quantity E ϑ σ that is the average error E(P σ ) computed for the NN N ϑ σ . 11. Analyze the set of points and then find the best fitting function E : R 2 → R with respect to the points in E such that they characterize the average error as a function of the parameters σ and ϑ.

Training Hyperparameters and Functions
For step 9 of the method described above, we talk about a fixed and arbitrary configuration of the hyperparameters and functions of the training phase. In particular, in this work, we perform the analysis considering two cases with two fixed configurations that are different only for the minibatch size adopted; we have that the first configuration is characterized by a minibatch size |B| = 10 while the second one has a minibatch size |B| = 30.
Let β be a parameter denoting which training configuration we choose; then, β = 1 represents the choice for the first configuration (|B| = 10) and β = 2 represents the choice for the second configuration (|B| = 30). All the other properties of the training configurations are the same for both the cases β = 1 and β = 2 and are as follows: • input data preprocessing: the input data are transformed applying first the function log 10 (element-wise) and then the z-normalization [43]; • output data preprocessing: the output data are rescaled by a factor equal to 10 6 ; • layer-weights initialization: Glorot (4)).
The shared hyperparameters and functions chosen for the configurations β = 1, 2 consist mainly in the default options provided by most of the frameworks for NN implementation; indeed, in this analysis, we focus our attention on the effects of the parameters σ and ϑ on the NNs and, therefore, we choose standard training configurations that should grant a reasonable training quality.

Results
Here, we show the application of the performance analysis method described in Section 2.3 on two test cases. In particular, we consider two DFNs, DFN158, and DFN395, generated with respect to the characterization of Section 2.1.2; the total number of fractures n is equal to 158 and 395 for DFN158 and DFN395, respectively, and the number of outflux fracture m is equal to 7 and 13 for DFN158 and DFN395, respectively.
For each of these two DFNs, we train three different NNs with architectures A α (see Section 2.2.3) for each α = 1, 2, 3 and with respect to the two training configurations β = 1 and β = 2 (see Section 2.3.2); then, we have a total number of six trained NNs, one for each (α, β) combination, for both DFN158 and DFN395. Moreover, we fixed the values ρ = 3000 for the test set cardinality and the set Θ = {500, 1000, 2000, 4000, 7000} of trainingvalidation set cardinalities ϑ. For the two DFNs considered, we define the set of distribution parameters σ such that In total, for the following analyses, we trained 180 NNs for DFN158 (30 for each (α, β) case) and 90 NNs for DFN395 (15 for each (α, β) case); the reason for the smaller set Σ and, therefore, a smaller number of trainings for DFN395 depends on the more expensive DFN simulations (with respect to the ones of DFN158) that are needed for the creation of the dataset D σ (see step 5 of the method in Section 2.3). The results found for the two DFNs are in very good agreement.
The analysis was performed for different combinations of the parameters α and β in order to show that the results found are general for the family of NNs. After these performance analyses, in Section 3.3, we describe the rules for the best choice of ϑ value given a σ value.

DFN158
Given the 180 trained NNs with respect to the datasets T ϑ σ and V ϑ σ of DFN158, we analyzed the set of points E (see (16)) for any fixed combination (α, β) ∈ {1, 2, 3} × {1, 2}; this set of points is described in Table 1 and illustrated in Figure 5.   For the average errors E ϑ σ , we observed the following behavior characteristics: 1.
The general trend of E ϑ σ decreases with respect to ϑ and increases with respect to σ. Indeed, higher values of ϑ provide more data for better training the NN whereas higher values for σ mean a larger variance for input data and, therefore, a more difficult target function F to be learned.

2.
Keeping the value of σ fixed, we observed that, in the logarithmic scale, the values of E ϑ σ are inversely proportional to ϑ (see Figure 6-left).

3.
Keeping the value of ϑ fixed, we observe that, in the logarithmic scale, the values of E ϑ σ increase with respect to σ (see Figure 6-right), with an almost quadratic behavior with respect to σ.
The numerical results and these observations actually suggest that the performances of an NN for flux regression seem to be characterized by well-defined hidden rules. Therefore, as proposed at the end of step 11 of the method, we sought a function E(σ, ϑ) such that for each (σ, ϑ) ∈ Σ × Θ.
Taking into account the observations at items 2 and 3, we decided to look for E(σ, ϑ) among the set of exponential functions characterized by exponents inversely proportional to ϑ and proportional to σ with linear or quadratic behavior, i.e., functions with the following expressions: where c 1 , c 3 ∈ R and c 2 , c 4 ∈ R ≥0 are parameters of the functions. Through a least square error minimization process, we found the best-fitting coefficients for the functions (18) with respect to the data points E ϑ σ (see Table 2). Looking at the results, we see that the observation made at item 3 concerning the quadratic behavior of E ϑ σ with respect to σ is confirmed; indeed, the approximation error of g 1 is always smaller than the one of g 2 (with a nonzero coefficient c 4 ). Then, we have that a good function E(σ, ϑ) for the characterization of the average errors is where c 1 , . . . , c 4 are the fixed parameters obtained with the least square minimization. We conclude this section with a visual example (Figure 7) of the fitting quality of E(σ, ϑ) for the values E ϑ σ of the case (α = 2, β = 2).

DFN395
Given the 90 trained NNs with respect to the datasets T ϑ σ and V ϑ σ of DFN395, we analyzed the set of points E for any fixed combination (α, β) ∈ {1, 2, 3} × {1, 2}. This set of points is described in Table 3 and illustrated in Figure 8.   Looking at the results in Table 4, it is very interesting to observe that the average errors E ϑ σ are characterized by the same behaviors observed for DFN158 and, therefore, they can be described by a functions E(σ, ϑ) with the same expressions deduced for DFN158. Table 4. DFN395. MSE and coefficients of the least square minimization with respect to E ϑ σ . We remark that the values of E ϑ σ increase faster with respect to σ than in Section 3.1.

Error Characterization with Training Data
In Proposition 1 of this section, assuming that the average error of an untrained NN is characterized by the function E(σ, ϑ) described in (19), we can identify the minimum value of ϑ (i.e., the minimum number of training data) required to obtain an average error smaller than an arbitrary quantity ε > 0 for each fixed σ ∈ R ≥0 ; in brief, for each fixed σ ∈ R, the proposition tells which is the minimum ϑ ∈ N such that E(σ, ϑ) ≤ ε.
We conclude this introduction to Proposition 1 by making a few remarks to its assumption on the coefficients c 1 , . . . , c 4 . By construction, it holds that c 2 , c 4 ∈ R ≥0 and c 1 , c 3 ∈ R but, looking at the coefficients in Tables 2 and 4, we observe that c 1 is always negative, c 2 is always positive, and c 3 is always nonnegative; then, in the proposition, we assume c 1 < 0, c 2 > 0 and c 3 ≥ 0. Proposition 1. Let E(σ, ϑ) be a function defined as in (19), such that c 1 < 0, c 2 > 0 and c 3 , c 4 ≥ 0. Then, for each ε > 0 and σ ∈ R ≥0 , the set of natural solutions Θ * ⊂ N of the inequality is characterized by the following: if ε > ε σ := e C σ , where C σ := c 1 + c 3 σ + c 4 σ 2 ; 2.
The threshold value ε σ = e C σ of Proposition 1 is actually the infimum of E(σ, ϑ), assuming a fixed σ: Thanks to Proposition 1, we can define a rule-of-thumb "UQ rule" for users who need to perform UQ on a DFN, with a number of fractures n in the order of magnitude around 158-395 generated by similar laws (Section 2.1.2) and who want to understand whether it is convenient to train an NN as a reduced model. This rule is based on the regular behavior characterizing the coefficients c 1 , . . . , c 4 of E(σ, ϑ), varying the hyperparamenters α and n for each fixed β. Indeed, for each fixed β and i = 1, . . . , 4, we observe that the values of the coefficient c i with respect to (α, n) are well-approximated by the function c Tables 5 and 6 and illustrated in Figures 9 and 10. The expression of the function c (β) i was chosen by looking at the positions of the points (α, n, c i ) in the space R 3 , for each i = 1, . . . , 4 and each β = 1, 2; future analyses, involving more DFNs (i.e., more cases for n), may surely help find better-fitting functions to describe the behavior of the coefficients c i .  i Function c 4 (α, n), for each fixed β = 1, 2, we can define a function that returns estimates of the average errors E ϑ σ for any NN with architecture A α trained with respect to a number ϑ of simulations (and configuration β) to approximate the fluxes of a DFN with n fractures (see Section 2.1.2) and transmissivity variation characterized by σ. Then, the UQ rule exploits (23) and Proposition 1, and it is outlined by the following steps:

1.
Let n ∈ {159, . . . , 394} ⊆ N be the number of fractures of a given DFN with fixed geometry generated with respect to the characterization of Section 2.1.2, and let σ be the parameter characterizing the standard deviation of the transmissivity distribution (see (1));
The reliability of the values ϑ i (α, n) representing the coefficients c i . Therefore, we conclude this section by testing the efficiency of the UQ rule and, consequently, the reliability of the expressions chosen in this work for the functions We validate and test the UQ rule with respect to DFN202, a DFN with n = 202 fracture (m = 14 outflow fractures) and transmissivity distribution characterized by σ = 1/3. We train an NN A α with configuration β ∈ {1, 2} on a number of simulations ϑ act equal to (24) rounded up to the nearest multiple of five for each ε ∈ {0.01, 0.02, 0.03} and each α ∈ {1, 2, 3}. For the case (α, β) = (3, 1), we do not use a value ε = 0.01 but a value ε = 0.011 because 0.01 is too close to the infimum error value e C (β) σ (α,n) = 0.0099 and, indeed, in this case, ϑ (α,β) ε is approximately equal to 20, 000; since we do not have enough simulations available to test ε = 0.01, we adopt ε = 0.011.  The average errors obtained for the test on DFN202 are reported in Table 7. For each (α, β), we report the minimum error value e C (β) σ (α,n) , the chosen target error ε > e C (β) σ (α,n) , the estimated minimum number of simulations ϑ (α,β) ε , the number of simulations ϑ act ∼ ϑ (α,β) ε performed for the training of the NN, and the final average error E ϑ σ returned by the trained NN on a test set P σ (with |P σ | = 3000). In all the cases, with ϑ act ∼ ϑ (α,β) ε , the error E ϑ σ is very close to the target error ε.

Discussion
Some examples concerning the use of deep learning models to speed up UQ analysis can be found in [33,34]. The use of DNNs as surrogate models for UQ is still a novel approach that requires deep investigations but is very promising. To the best of the authors' knowledge, other than [35][36][37], there are no works in the literature that train DNNs to perform flux regression tasks on DFNs and, in particular, that use these NNs in the context of UQ as in [35]. While the results illustrated in [35] are very promising, the ones presented in Section 3 of this work concerns the use of NN reduced models as a practical possibility in the UQ framework for flow analyses of a subsurface network of fractures.
Let us assume that we deal with a natural fractured basin that can be described by a number of principal fractures in the order of {158, . . . , 395} and probability distributions for fractures and hydrogeological properties as in Section 2.1.2, with a fixed value σ ∈ [0.2, 0.7] ⊂ R. The stochastic flow analysis can be very relevant for geothermal energy exploitation and for enhanced oil and gas exploitation. Flux investigations can also be relevant in risk assessment for geological storage of nuclear waste. The approach could be extended to different situations, for example, to provide a statistical analysis of the effects of different fracturing approaches [46][47][48].
Uncertainty in fractures and hydrogeological properties requires the generation of an ensemble of DFNs describing the principal flow properties of the basin, and consequently, a UQ analysis of the flow properties is required. The results presented in Section 3.3 can be useful for deciding if the training of a DNN is convenient with respect to a Monte Carlo approach or the use of a different surrogate flow model. Thanks to the results in Section 3.3, we have the possibility to fix an approximation tolerance ε > 0 such that an NN trained on ∼ ϑ ε simulations fits the target tolerance.
Let us provide an example with DFN202, the validation DFN of Section 3.3 (σ = 1/3). During a UQ analysis for this DFN, a standard approach may need thousands of simulations to obtain good estimations of the mean value and the standard deviation of the flux exiting from the DFN. Nevertheless, the UQ rule tells us that we can train an NN with approximately 2% or 1% average error with less than 300 simulations or approximately 1000 simulations, respectively (see Table 7, (α, β) = (1, 1) case). Then, once that has been trained, a NN can return virtually infinite reliable predictions (i.e., approximations) of the DFN exiting fluxes, varying the fracture transmissivities, in the order of seconds; therefore, we can estimate the exiting flux's momentum using the NN predictions with a total cost of only the ∼ ϑ ε DFN simulations used to train the NN. If we repeat the procedure for each geometry of DFN generated for the study, the advantages are significant.
A possible drawback of our method is that a UQ rule must be defined for the family of problems and NN architectures considered. Indeed, the UQ rule defined in Section 3.3 is tailored on the multi-task architecture described in Section 2.2.3, applied to the family of DFNs defined by the probability distributions in Section 2.1.2. Moreover, the UQ rule of this work can be considered reliable at most for DFNs with few hundreds of fractures. The analysis performed here can be extended to larger DFNs and can provide useful information to wider applications.
The approach presented here is not immediately extensible to the case of DFNs with a stochastic geometry (see [49]) due to the continuous change in inflow and outflow fractures. Nevertheless, a similar approach could be extended to the case of analysis of flows through the DFN that occurs between a fixed set of wells. In that case, the NN can provide flow through fixed wells varying the DFN geometry and the hydraulic properties of the fractures. In that case, we expect that the number of training simulations increases but the proposed approach could provide information correlating the target error tolerance with the variance in the stochastic distributions and the number of fractures.

Conclusions
With this work, we proposed an analysis for the characterization of a family of DNNs with multi-task architecture A α trained to predict the exiting fluxes of a DFN given the fracture transmissivities. The novelty of this analysis consists in characterizing these NNs, searching for rules that describe the performances, varying the available training data (ϑ) and the standard deviation of the inputs (σ). The results of our study show interesting common behaviors for all the trained NNs, providing characterization of the average error with the functions E(σ, ϑ) and E β (σ, ϑ, α, n) (see (19) and (23)). This result is interesting, since it shows that common characterizing formulas for NN performances exist, despite the stochastic nature of the NN training processes; thanks to these regularities, we are able to define a "UQ rule" that returns an estimate of the minimum number of simulations required for training an NN with an average error less than or equal to an arbitrary value ε > e C σ .
This estimate can be fruitfully exploited in real-world problems. Indeed, in the framework of UQ, it suggests whether it is convenient to train an NN as a reduced model and that a user can choose the best strategy between the use of an NN or direct simulations. In particular, the estimate returned by the UQ rule can be exploited in all "real-world" applications in which flow through a DFN with stochastic trasmissivities is recommended. The fields of interest could be oil and gas extraction, where flows through a fractured medium that occur between a fixed set of wells need to be analyzed and the possible effects of phenomena that can impact the fracture transmissivities (for example clogging) should be foreseen. Similar needs could occur in designing geothermal sites for which the performances strongly depend on the flow properties. Other application examples could be flow analysis for geological risk assessments of geological carbon dioxide or nuclear waste storage or water prevention close to other pollutant storage sites. The usage of NNs as reduced models for DFN flow simulations, optimizing the number of required numerical simulations for training through the UQ rule, can save precious time when computing an estimate of the risks and, therefore, deciding how to intervene when preventing or managing a calamity.
In general, we believe that many approaches for underground flow analysis through DFNs can be endowed with a tailored version of the method proposed in this paper; then, the method can speed up the simulation process, which is often slow and expensive, granting considerable advantages in many real-world geophysical applications.

Abbreviations
The following abbreviations and nomenclatures are used in this manuscript: