Learning Algorithms for Coarsening Uncertainty Space and Applications to Multiscale Simulations

In this paper, we investigate and design multiscale simulations for stochastic multiscale PDEs. As for the space, we consider a coarse grid and a known multiscale method, the Generalized Multiscale Finite Element Method (GMsFEM). In order to obtain a small dimensional representation of the solution in each coarse block, the uncertainty space needs to be partitioned (coarsened). This coarsening collects realizations that provide similar multiscale features as outlined in GMsFEM (or other method of choice). This step is known to be computationally demanding as it requires many local solves and clustering based on them. In this paper, we take a different approach and learn coarsening the uncertainty space. Our methods use deep learning techniques in identifying clusters(coarsening) in the uncertainty space. We use convolutional neural networks combined with some techniques in adversary neural networks. We define appropriate loss functions in the proposed neural networks, where the loss function is composed of several parts that includes terms related to clusters and reconstruction of basis functions. We present numerical results for channelized permeability fields in the examples of flows in porous media.


Introduction
Many problems are multiscale with uncertainties.Examples include problems in porous media, material sciences, biological sciences, and so on.For example, in porous media applications, engineers can obtain fine-scale data about pore geometries or subsurface properties at very fine resolutions.These data are obtained in some spatial locations and then generalized to the entire reservoir domain.As a result, one uses geostatistical or other statistical tools to populate the media properties in space.The resulting porous media properties are stochastic and one needs to deal with many porous media realizations, where each realization is multiscale and varies at very fine scales.
Simulating each realization can be computationally expensive because of the media's multiscale nature.Our objective is to simulate many of these realizations.To address the issues associated with spatial and temporal scales, many multiscale methods have been developed ([3, 11, 12, 14, 15, 18, 19, 9, 8, 7, 10]).These methods perform simulations on the coarse grid by developing reduced-order models.However, developing reduced-order models requires local computations, which can be expensive when one deals with many realizations.For this reason, some type of coarsening of the uncertainty space is needed ( [13]).In this paper, we consider some novel approaches for developing coarsening of uncertainty space as discussed below.
To coarsen the uncertainty space, clustering algorithms are often used; but a proper distance function should be designed in order to make the clusters have physical sense and achieve a reduction in the uncertainty space.The paper ( [13]) proposed a method that uses the distance between local solutions.
The motivation is that the local problems with random boundary conditions can represent the main models with all boundary conditions.Due to a high dimension of the uncertainty space, the authors in ( [13]) proposed to compute the local solutions of only several realizations and then use the Karhunen-Loeve expansion( [28]) to approximate the solutions of all the other realizations.The distance function is then defined to be the distance between solutions and the standard K-means ( [33]) algorithm is used to cluster the uncertainty space.
The issue with this method is computing the local solutions in the local neighborhoods.It is computationally expensive to compute the local solutions; although the KL expansion can save time to approximate the solutions of other realizations, one still needs to decide how many selected realizations we need to represent all the other solutions.In this paper, we propose the use of deep learning methodology and avoid explicit clustering as in earlier works.We remark that the development of deep learning techniques for multiscale simulations are recently reported in [44,43,41,6,42].
In this work, to coarsen the uncertainty space, we propose a deep learning algorithm which will learn the clusters for each local neighborhood.Due the nature of the permeability fields, we can use the transfer learning which uses the parameters of one local neighborhood to initialize the learning of all the other neighborhoods.This saves significantly computational time.
The auto encoder structure [20] has been widely used in improving the K-mean clustering algorithm ( [5,45,46]).The idea is to use the encoder to extract features and reduce the dimension; the encoding process can also be taken as a kernel method [2] which maps the data to a space which is easier to be separated.The decoder is used to upsample the latent space (reduced dimension feature space) back to the input space.The clustering algorithm is then used to cluster the latent space, which will save time due to the low dimension of the latent space and also preserve the accuracy due to the features extracted by the encoder.
Traditionally, the learning process is only involved in reconstructing the input space.Such kind of methods ignore the features extracted by latent space; so, it is not clear if the latent space is good enough to represent the input space and is easily clustered by the K-means method.In [46], the authors proposed a new loss which includes the reconstruction loss meanwhile the loss results from the clustering.The authors claimed that the new loss improves the clustering results.
We will apply the auto encoder structure and the multiple loss function; however, we will design the auto encoder as a generative network, i.e., the input and output space are different.More precisely, the input is the uncertain space (permeability fields) and the output will be the multiscale functions co-responding to the uncertain space.Intuitively, we want to use the multiscale basis to supervise the learning of the clusters so that the clusters will inherit the property of the solution.The motivation is the multiscale basis can somehow represent the real solutions and permeability fields; hence, the latent space is no longer good for clustering the input space but will be suitable for representing the multiscale basis function space.
To define the reconstructing loss, the common idea is the mean square error (MSE); but many works ( [25,31,30,27]) have shown that the MSE tends to produce the average effect.In fact, in the area of image super-resolution ( [25,31,30,27,16,29,48,49,39,32,40]) and other low level computer vision tasks, the generated images are usually over-smooth if trained using MSE.The theory is the MSE will capture the low frequency features like the background which is relatively steady; but for images with high contrast, the MSE will usually try to blur the images and the resulting images will lose the colorfulness and become less vivid ( [25]).Our problem has multiscale nature and we want to capture the dominant modes and multiscale features, hence a single MSE is clearly not enough.
Following the idea from ( [27,31]), we consider adding an adversary net ( [21]).The motivation is the fact that different layers of fully convolutional network extract different features ( [23,27,47]).Deep fully convolutional neural networks (FCN) ( [34,37,38,1,35,24]) have demonstrated its power in almost all computer vision tasks.Convolution operation is a local operation and the network with full convolutions are independent with the input size.People now are clear about the functioning of the different layers of the FCN.In computer vision task, the lower layers (layers near input) tend to generate sharing features of all objects like edges and curves while the higher layers (near output) are more object oriented.If we train the network using the loss from the lower layers, the texture and details are persevered, while the higher layers will keep the general spatial structure.This motivates us using the losses from different layers of the fully convolutional layers.Multiple layers will give us a multilevel capture of the basis features and hence measure the basis in a more complete way.To implement the idea, we will pretrain an adversary net; and input the multiscale basis of the generative net and the real basis.The losses then come from some selected layers of the adversary net.Although it is still not clear the speciality of each layer, if we consider the multiscale physical problem, the experiments show that the accuracy is improved and, amazingly, the training becomes easier when compared to the MSE of the basis directly.The uncertain space coarsening (cluster) is performed using the deep learning idea described above.But due to the space dimension, we will perform the clustering algorithm locally in space; that is, we first need a spatial coarsening.Due to the multiscale natural of the problem, this motivates us using the generalized multiscale finite element methods (GMsFEM) which derives the multiscale basis of a coarse neighborhood by solving the local problem.GMeFEM was first proposed in ( [17]) and further studied in ([3, 11, 12, 14, 15, 18, 19, 9]).This method is a generalization of the multiscale finite element method ( [22,26]).The work starts from constructing the snapshot space for each local neighborhood.The snapshot space is constructed by solving local problems and several methods including harmonic extension, random boundary condition [4] have been proposed.Once we have the snapshot space, the offline space which will be used as computing the solution are constructed by using spectral decomposition.
The rest of the work is organized as follow: in Section 2, we consider the problem setup and introduce both uncertain space and spatial coarsening.In Section 3, we introduce the structure of the network and the training algorithm.In Section 4, we will present the numerical results.The paper ends with conclusions.

Problem Settings
In this section, we will present some basic ideas involving the use of the generalized multiscale finite element method (GMsFEM) for parameter-dependent problems.Let D be a bounded domain in R 2 and Ω be the parameter space in R N .We consider the following parameter-dependent elliptic problem: where κ(x, s) is a heterogeneous coefficient depending on both the spatial variable x and the parameter s, and f ∈ L 2 (D) is a given source.We remark that the differential operators in (1) are defined with respect to the spatial variable x.This is the case for the rest of the paper.

2.1
The Coarsening of the Parameter Space.The Main Idea.
The parameter space Ω is assumed to be of very high dimension (i.e.large N ) and consists of very large number of realizations.For a given realization, the idea is to find its representation in the coarse space and use the coarse space to perform the computation.We will use the deep cluster learning algorithm to perform the coarsening.Due to the heterogeneous properties of the proposed problem, fine mesh is used; this will bring difficulties in coarsening the parameter space and in computation of the solution.We hence perform the parameter coarsening locally in the space D, i.e., we also coarsen the spatial domain.
To coarsen the spatial domain, we use coarse grids and consider the GMsFEM.
In Figure 1, we present an illustration of the proposed coarsening technique.On the left figure, the coarse grid blocks in the space are shown.Each coarse grid has a different clusters in the uncertainty space Ω, which corresponds to the coarsening of the uncertainty space.The main objective in multiscale methods is efficiently finding the clustering of the uncertainty space, which is our main goal.

Space Coarsening -Generalized Multiscale Finite Element Method
It is computationally expensive to capture heterogeneous properties using very fine mesh.For this reason, we use GMsFEM to coarsen the spatial representation of the solution.The coarsening of the parameter space will be performed in each local spatial neighborhood.We will achieve this goal by the GMsFEM, which will briefly be discussed.Consider the second order elliptic equations Lu = f in D with proper boundary conditions; denote the the elliptic operator as: Let the spatial domain D be partitioned by a coarse grid T H ; this does not resolve the multiscale features.Let us denote K as one cell in T H and refine K to obtain the fine grid partition T h (blue box in Figure 2).We assume the fine grid is a conforming refinement of the coarse grid.See Figure 2 for details.For the i-th coarse grid node, let ω i be the set of all coarse elements having the vertex i (green region in Figure 2).We will solve local problem in each coarse neighborhood to obtain set of multiscale basis functions {φ ωi i } and seek solution in the form: where φ ωi j is the offline basis function in the i-th coarse neighborhood ω i and j denotes the j-th basis function.Before we construct the offline basis, we first need to derive the snapshot basis.

Snapshot Space
There are several methods to construct the snapshot space; we will use the harmonic extension of the fine grid functions defined on the boundary of ω i .Let us denote δ h l (x) as fine grid delta function, which is defined as δ h l (x k ) = δ l,k for x k ∈ J h (ω i ) where J h (ω i ) denotes the boundary nodes of ω i .The snapshot function ψ ωi l is then calculated by solving local problem in ω i : subject to the boundary condition ψ ωi l = δ h l (x).The snapshot space V ωi snap is then constructed as the span of all snapshot functions.

Offline Spaces
The offline space V ωi of f is derived from the snapshot space and is used for computing the solution of the problem.We need to solve for a spetral problem and this can be summarized as finding λ and v ∈ V ωi snap such that: where a ωi is symmetric non-negative definite bilinear form and s ωi is symmetric positive definite bilinear form.By convergence analysis, they are given by In the above definition of s ωi , the function κ = κ |∇χ j | 2 where {χ j } is a set of partition of unity functions corresponding to the coarse grid partition of the domain D and the summation is taken over all the functions in this set.The offline space is then constructed by choosing the smallest L i eigenvalues and we can form the space by the linear combination of snapshot basis using corresponding eigenvectors: where Ψ ωi kj is the jth element of kth eigenvector and L i is the number of snapshot basis.V of f is then defined as the collection of all local offline basis functions.Finally we are trying to find where a(u, v) = D κ∇u • ∇v.For more details, we refer the readers to the references [18,19,9].

The Idea of the Proposed Method
We present the general methodology in this section.The target is to save the time in computing the GMsFEM basis φ ωi k for all ω i and for all uncertain space parameters.We propose the clustering algorithm to coarsen the uncertain space in each local neighborhood.The key to the success of the clustering is that: the cluster should inherit the property of the solution, that is, the local heterogeneous fields κ(x, s) clustered into the same group should have similar solution properties.When the cluster is learnt by the some learning algorithm, the only computation involved is to fit the local neighborhood of the given testing heterogeneous field into some cluster.This is a feed forward process including several convolution operations and matrix multiplications and compared to the direct computing, we save a lot of time in computing the spectral problem (6) and the inverse of a matrix (10).The detailed process is illustrated in the following chart (Figure 3): (Training) For a given input local neighborhood ω j , we train the cluster (which will be detailed in next section) of the parameter space Ω and get the clusters S j 1 , ..., S j n , where n is the number of clusters and is uniform for all j.Please note that we may have different cluster assignments in different local neighborhoods.

(Training)
For each local neighborhood ω j and cluster S j i , define the average κij and compute generalized multiscale basis for κij .

(Testing)
Given a new κ(x, s) and for each local neighborhood ω j , fit κ(x, s) into a S j i by the trained network (step 1) and use the pre-computed GMsFEM basis (step 2) to find the solution.
It should be noted that we perform clustering using the heterogeneous fields; however, the cluster should inherit the property of the solution corresponding to the heterogeneous fields.This makes the clustering challenging.The performance of the standard K-Means algorithm relies on the initialization and the distance metric.We may initialize the algorithm based on the clustering of the heterogeneous fields but we need to design a good metric.In the next section, we are going to introduce a learning algorithm which uses an auto-encoder structure and multiple losses to achieve the required clustering task.

Deep Learning
The network is consisted of two sub networks.The first one is targeted to performing the clustering and second one, which is the adversary net, will serve as the reconstruction of loss function.

Clustering Net
The cluster net is aimed for clustering the heterogeneous fields κ(x, s); but the resulting clusters should inherit the properties of the solution corresponding to the κ(x, s), i.e., the heterogeneous fields grouped in the same cluster should have similar corresponding solution properties.This similarity will be measured by the adversary net which will be introduced in section (3.3).We hence design the network demonstrated in Figure 4.The input X ∈ R m,d ,where m is the number of samples and d is the dimension of one local heterogeneous field, of the network is the local heterogeneous fields which are parametrized by the random variable s ∈ Ω.The output of the network is the multiscale basis (first GMsFEM basis) which somehow represents the solution corresponding to the coefficient κ(x, s).This is a generative network which has an auto encoder structure.The dimension reduction function F (X) can be interpreted as some kind of kernel method which maps the input data to a new space which is easier to be separated.This can also be interpreted as the learning of a good metric function for the later performed K-mean clustering.We will perform K-means clustering algorithm in latent space F (X). G(•) will then transfer the latent space data to the space of multiscale basis function.This process can be taken as a generative process and we reconstruct the basis from the extracted features.The detailed algorithm is as follow (see Figure 5 for an illustration): 3. Cluster the latent space by K-Means algorithm (reduced dimension space, which is a middle layer of the cluster network); the latent space data are computed using the previous optimized parameters; the assignment will be denoted as A.
4. Basis functions whose corresponding inputs are in the same cluster (basing on assignment A) will be grouped together.No training or fitting-in involved in this step.
5. Repeat step 2 to step 4 until the stopping criteria is met.

Loss Functions
Loss function is the key to the deep learning.Our loss function is consisted of cluster loss and the reconstruction loss.
1. Clustering loss C(θ F , θ G ): this is the mean standard deviation of all clusters of the learnt basis and θ is the parameters we need to optimize.It should be noted that the loss here is computed using the learnt basis instead of the input of the network.This loss controls the clustering process, i.e., the smaller the loss, the better the clustering in the sense of clustering the multiscale basis.Let us denote κ ij as jth realization in ith cluster; G(F (κ ij )) ∈ R d will then be jth learnt basis in cluster i and let θ G and θ F be the parameters associated with G and F , the loss is then defined as follow, where |A| is the number of clusters which is a hyper parameter and A i denotes the number of elements in cluster i; φi ∈ R d is the mean of cluster i.This loss clearly serves the purpose of clustering the solution instead of the input heterogeneous fields; however, in order to guarantee the learnt basis are closed to the pre-computed multiscale basis, we need to define the reconstruction loss which measures the difference between the learnt basis and the pre-computed basis.
2. Reconstruction loss R(θ F , θ G ): this is the mean square error of multiscale basis Y ∈ R m,d , where m is the number of samples.This loss controls the construction process, i.e., if the loss is small, the learnt basis are close to the real multiscale basis.This loss will supervise the learning of the cluster.It is defined as follow: where G(F (κ i )) ∈ R d and φ i ∈ R d are learnt and pre-computed multiscale basis of ith sample κ i separately.
The entire loss function is then defined as , where λ 1 , λ 2 are predefined weights.We are going to solve the following optimization problem: for the required training process.

Adversary Network Severing as an Additional Loss
We have introduced the reconstruction loss which measures the similarity between the learnt basis and the pre-computed basis in the previous section.It is the Mean square error (MSE) of the learnt and pre-computed basis.MSE is a smooth loss and easy to train but there is a well known fact about MSE that this loss will blur the image.In the area of image super-resolution and other low level computer vision tasks, the loss is not friendly to inputs with high contrast and the resulting generated images are usually over smooth.Our problem has multiscale nature and is similar with the low level computer vision task, i.e., this is a generative task; hence blurring and over smooth should happen if the model is trained by MSE.To define a great reconstruction loss is important.Motivated by some works about the successful application of deep fully convolutional network (FCN) in computer vision, we design a perceptual loss to measure the error.It is now clear that the lower layers in the FCN usually will extract some general features shared by all objects like the horizontal (vertical) curves, while the higher layers are usually more objects oriented.This gives people the insight to train the network using different layers.Johnson then proposed the perceptual loss [27] which is the combination of the MSE of selected layers of the VGG model [36].The authors claim in their paper that the early layers tends to produce images that are visually indistinguishable from the input; however if reconstruct from higher layers, image content and overall spatial structure are preserved but color, texture, and exact shape are not.
We will adopt the perceptual loss idea and design an adversary network to compute an additional reconstruction loss.The network structure can be seen in Figure 6.The adversary net is fully convolutional with input and output both pre-computed multiscale basis.The network has an auto encoder structure and is pre-trained; i.e., we are going to solve the following minimization problem: where φ i is the multiscale basis and f is the adversary net associated with trainable parameter θ A .Denote f j (•) as the output of layer j of the adversary network.The additional reconstruction loss is then redefined as: where I is the index set which contains some layers of the adversary net.The complete optimization problem can be now formulated as:

Numerical Experiments
In this section, we will demonstrate a series of experiments.We are going to apply our method on problems with high contrast including moving background and moving channels.Let us first introduce the high contrast field.

High Contrast Heterogeneous Fields with Moving Channels
We consider solving (1)-( 2) for a heterogeneous field with moving channels and changing background.
Let us denote the heterogeneous field as κ(x), where x ∈ [0, 1] 2 , then κ(x) = 1000 if x is in some channels which will be illustrated later and otherwise, κ(x) = e η•sin(7πx)sin(8πy)+sin(10πx)sin(12πy) , where η follows discrete uniform distribution in [0, 1].The channels are moving and we include cases of the intersection of two channels and formation and disappearance of the channels in the fields.In Figure 7, we demonstrate 20 heterogeneous fields.We train the network using 600 samples using the Adam gradient descent.We find that the cluster assignment of 600 realizations in uncertain space is stable(fixed) when the gradient descent epoch reaches a certain number, so we set the stopping criteria to be: the assignment does not change for 100 iteration epochs; and the maximum number of iteration epochs is set to be 1000.We also find that the coefficients in (16) can affect the training result.We set λ 1 = λ 2 = λ 3 = 1.
It should be noted that we train the network locally in each coarse neighborhood.The fine mesh element has size 1/100 and 5 fine elements are merged into one coarse element.

Results
We will present the numerical results of the proposed method in this section.We are going to show the cluster assignment experiment first, followed by two other experiments which demonstrate the error of the method.

Cluster Assignment in a Local Coarse Element
Before diving into the error analysis, we will show some of the cluster results in a local neighborhood.In this neighborhood, we manually created the cases such as: the extraction of a channel (longer) , the expansion of a channel(wider), the discontinuity of a channel, the diagonal channels, the intersection of channels an so on.In Figure 8, the number on top of each image is the cluster assignment ID number.We also demonstrate the clustering result in Figure 9 of another neighborhood which is around (25,45) in Figure 7. From the results in both Figure 8 and Figure 9, we observe that our proposed clustering algorithm based on deep learning is able create a good clustering of the parameter space.That is, heterogeneous coefficients with similar spatial structures are grouped in the same cluster.

Relation of Error and the Number of Clusters
In this section, we will demonstrate the error change when the hyperparamter -the number of clusters increases.Given a new realization κ(x, ŝ) where ŝ denotes the parameter and a fixed neighborhood, suppose the neighborhood of this realization will be fitted into cluster S i by the model trained.We compute κi = 1 |Si| |Si| j=1 κ ij where |S i | is the number of points in this cluster S i .The GMsFEM basis of this neighborhood can then be derived using κi .We finally construct the solution using the GMsFEM basis pre-computed in all neighborhoods.We define the l 2 relative error as : where u is the exact solution computed by finite element method with fine enough mesh and u H is the solution of the proposed method.We test the network on newly generated 300 samples and take the average of the errors.
In this experiment, we calculate the l 2 relative error with the number of clusters increases.The number of clusters ranges from 5 to 11; and for each case, we will train the model and compute the l 2 relative error.The result can be seen in Figure 10 and it can be observed from the picture that, the error is decreasing with the number of cluster increases.

Comparison of Cluster-based Method with Tradition Method
In the second experiments, we first compute the l 2 relative error (defined in (17) with u H denotes the GMsFEM solution) of traditional GMsFEM method with given κ(x, ŝ).This means that the construct multiscale basis functions using the particular realization κ(x, ŝ).We then compare this error with the cluster method proposed (11 clusters).The comparison can be seen in Figure 11.It can be seen that the difference is negligible when the number of clusters reaches 11.We can then benefit from the deep learning; i.e., the fitting of κ(x, ŝ) into a cluster is fast; and since we will use the pre-computed basis, we also save time on computing the GMsFEM basis.

Effect of the Adversary Net
The target of this task is not the learning of multiscale basis; the multiscale basis in this work is just a supervision of learning the cluster.However to demonstrate the effectiveness of the adversary network, we also test the the effect of the adversary net.There are many hyper-parameters like the number of clusters and coefficients of the loss function which can affect the result; so to reduce the influence from the clustering, we remove the clustering loss from the training, so this is a generative task which will generate the multiscale basis from the output of the first network in Figure 6.The loss function now can be defined as: where R and A are defined in (12) and ( 15) separately; and λ 1 and λ 1 are both set to be 1.We compute the relative error (17) first by using the learnt multiscale basis which is trained by (18); and second by using the multiscale basis trained without the adversary loss (15), i.e., min The l 2 relative error improves from 41.120439 to 36.760918 if we add one middle layer from the adversary net.
We also calculate the MSE difference of two learnt basis (by loss (18) and ( 19) separately) and real multiscale basis, i.e., we calculate B learnt basis − B real basis M SE , where B learnt basis refers to two basis trained with (18) and (19) separately and B real basis is the real multiscale basis formed using the input heterogeneous field.The MSE amazingly decreases from 0.9073400 to 0.748312 if we use basis trained with the adversary loss (18).This can show the benefit from the adversary net.

Conclusion
We propose a deep learning clustering technique within GMsFEM to solve flows in heterogeneous media.The main idea is to cluster the uncertainty space such that we can reduce the number of multiscale basis functions for each coarse block across the uncertainty space.We propose the adversary loss motivated by the perceptual loss in the computer vision task.We use convolutional neural networks combined with some techniques in adversary neural networks, where the loss function is composed of several parts that includes terms related to clusters and reconstruction of basis functions.We present numerical results for channelized permeability fields in the examples of flows in porous media.In future, we would like to study the relation between convolutional layers and quantities related to multiscale basis functions.

Figure 1 :
Figure 1: Illustration of coarsening of space and uncertainties.Different clusters for different coarse blocks.On the left plot, two coarse blocks are shown.On the right plot, clusters are illustrated.

Figure 3 :
Figure 3: Work Flow of the Proposed Method

Figure 6 :
Figure 6: The Complete Network

Figure 7 :
Figure 7: Heterogeneous fields, the yellow strips are the channels

Figure 8 :
Figure 8: Cluster results of 28 samples, images shown are heterogeneous fields, the number on top of each image is the cluster assignment ID number.

Figure 9 :
Figure 9: Cluster results of 20 samples, images shown are heterogeneous fields, the number on top of each image is the cluster assignment ID number.

Figure 10 :
Figure 10: L 2 error with the number of clusters changes, colors represent the number of GMsFEM basis