Efﬁcient Dimensionality Reduction Methods in Reservoir History Matching

: Production forecasting is the basis for decision making in the oil and gas industry, and can be quite challenging, especially in terms of complex geological modeling of the subsurface. To help solve this problem, assisted history matching built on ensemble-based analysis such as the ensemble smoother and ensemble Kalman ﬁlter is useful in estimating models that preserve geological realism and have predictive capabilities. These methods tend, however, to be computationally demanding, as they require a large ensemble size for stable convergence. In this paper, we propose a novel method of uncertainty quantiﬁcation and reservoir model calibration with much-reduced computation time. This approach is based on a sequential combination of nonlinear dimensionality reduction techniques: t-distributed stochastic neighbor embedding or the Gaussian process latent variable model and clustering K-means, along with the data assimilation method ensemble smoother with multiple data assimilation. The cluster analysis with t-distributed stochastic neighbor embedding and Gaussian process latent variable model is used to reduce the number of initial geostatistical realizations and select a set of optimal reservoir models that have similar production performance to the reference model. We then apply ensemble smoother with multiple data assimilation for providing reliable assimilation results. Experimental results based on the Brugge ﬁeld case data verify the efﬁciency of the proposed approach.


Introduction
Research scientists have worked for many years to develop viable methods to calibrate complex reservoir models. However, the uncertainty associated with reservoir models is highly significant, introducing considerable errors in the modeling process. There are several ways to quantify uncertainty in reservoirs. One is the conditioning of reservoir parameters to observed production data, a process referred to as inverse problem or history matching (HM). The first step of HM is parameterization, namely to independently define and vary the model variables in a numerical reservoir simulation model: porosity, permeability, the density and permeability of fractures, the initial depths of oil-water and gas-oil contacts, relative permeability curves, capillary pressure curves, fluid composition, aquifer strength, and the size and fault transmissibility [1]. It is not realistic to do so, however, because of the large area of possible adjustment caused by the large number of grid blocks and variables; the number of varying parameters should therefore be as small as possible. To do this, a reparameterization method based on the pilot point method, the spline function method, the wavelet function method, Karhunen-Loeve reparameterization, and discrete cosine transform was used [2]. The second step is to select the production data, which must be sensitive to the parameters needed to be history matched. The sensitivity becomes more complex, however, in cases using reservoirs with multiphase flow. In these cases, the cross-covariance of production data to model variables is used instead, its main advantage being that it is generally smoother and can show a more global relationship between data and variables, since it is a product of sensitivities and covariances.
The algorithms for HM are diverse. Evolutionary algorithms are often considered the standard approach, since, by generating a new model combining two Gaussian reservoir models, the gradual deformation algorithm reduces the HM problem to a one-dimensional minimization problem [2]. Sambridge (1999) [3] introduced a neighborhood algorithm in which a resampling of the parameters is led by using information in an available ensemble. In addition, several other methods have been introduced to optimize reservoir models via particle swarm optimization [4], simulated annealing [5], and simultaneous perturbation stochastic approximation [6]. When solving the history matching problem, a key issue must be considered: uncertainty quantification. Uncertainty quantification requires strong knowledge of the reservoir characteristics, and uncertainty should be represented by a set of reservoir models (or realizations) instead of a single history-matched model. The Markov chain Monte Carlo method (McMC) [7], the randomized maximum likelihood method [8], the EnKF method [8][9][10], the ensemble smoother (ES) [11], and the ensemble smoother with multiple data assimilation (ES-MDA) [12,13] are useful methods to quantify uncertainty. For all of these techniques, accuracy and speed are two main factors due to the non-unique solutions and the ill-posed inverse problems.
Many parametrization methods used in DR have already been introduced. For instance, Vo and Durlofsky [14] used principal component analysis (PCA) to reparametrize high dimension data into low dimensional space, then regenerated new realizations based on principal parameters from PCA for data assimilation, while others have used singular value decomposition [15] and Kernel PCA (KPCA) [16]. Muzammil [18,19] also introduced PCA to select suitable models for EnKF. Tolstukhin et al. [20] demonstrated how data analytics can improve efficiency of ensemble history matching by analyzing the statistics that link the static model ensemble and the dynamic model ensemble update. Satija et al. [21] proposed a method known as direct forecasting (DF) based on projecting the prior predictions into a low-dimensional canonical space to maximize the projected oil data and estimate the joint distribution of historical and forecasted data through linear Gaussian regression; they concluded that this method provided uncertainty estimates regarding production forecast that reasonably agreed with rejection sampling. Park et al. [22] proposed an extended approach based on direct forecasting, where both of the geological model parameters and dynamic data are simultaneously used. Our approach in the current paper is different from the previous work in Kang et al. (2019) [19]. Dimensionality reduction techniques such as PCA and SVD, however, are linear approaches that may not accurately represent the relationship between high dimensional parameters and latent variables in reduced space, which likely lead to poor performance of model assimilation and prediction. In addition, the use of EnKF tends to be computationally prohibitive in certain circumstances and also generates spurious correlations leading to loss of geological realism and underestimation of uncertainties (ensemble collapse). In this work, we propose a novel scheme to reduce the number of ensemble members while preserving the prediction quality by combining ES-MDA with machine learning DR techniques and cluster analysis. In this paper, we demonstrate the efficiency of using the non-linear DR techniques t-distributed stochastic neighbor embedding (t-SNE) [11] and Gaussian process latent variable model (GPLVM) [23,24] along with clustering K-means to select effective reservoir models and save computational time without simulating and assimilating the entire initial ensemble. This study uses the Brugge field reservoir case to demonstrate that the new implementation can make computation faster and more robust than the standard procedure proposed in [12,13] and can provide appropriate posterior uncertainty quantification.
The paper is structured as follows. In the next section, we present the complete methodology, by which we tested the proposed workflow in the well-known Brugge field reservoir model. In addition, several cases, involving different reference models, are considered. Finally, some concluding remarks and possible future work directions are provided.

Materials and Methods
The procedure applied in this study has four main stages: 1.
The first stage includes generating ensemble reservoir models and analyzing whether the observed (reference) prior data can predict posterior distribution that appertains to the prior range.

2.
The second stage involves reducing the ensemble dimension and constructing a 2D space by using t-SNE and GPLVM. 3.
The third stage uses clustering K-means to extract a set of reservoir models with the least production error compared to the reference model.

4.
After extracting the models and selecting the most informative ones, we began the HM process using ES-MDA, and finally we compared the performance of history matching analysis of the proposed workflow with the standard ES-MDA without using dimensionality reduction techniques.
The general steps of the approach are shown in Figure 1 and algorithm solutions employed are described in more detail later.

Prior Sampling and Analysis
Due to the high dimensionality and nonlinearity of subsurface systems physics-based models, Monte Carlo simulations were used to sample and identify the possible prior range of model parameterization and probability distribution for each geological parameter (e.g., the structural model, rock types, the petrophysical model, and subsurface fluid distribution). Let m ∈ R N denote the vector of uncertain static parameters of a reservoir model with a dynamic data variable (e.g., oil production and water cuts) as vector d. The nonlinear function data forward model is defined as The function G d is generated through a reservoir simulator and by applying it to prior geological model realizations, m = m 1 , m 2 , m 3 , . . . . . . . . . m N . We obtained a set of N samples of dynamic data variables, d = d 1 , d 2 , d 3 , . . . . . . . . . d N . We refer to the vector of observation data as d obs . Once the prior samples are generated, it is important to check that the observed data can be predicted by the prior model, in order for the posterior distribution to appertain in the prior range. Otherwise, there is a risk that the prediction will be erroneous. If the prior model is falsified, which indicates inconsistency with the data, we must revise the prior data distribution herein to evaluate the quality of the prior model and its ability to predict the data. We proposed a statistical procedure based on Robust Mahalanobis distance (RMD) [25,26], which handles high dimensional and different types of measurements of the data, the main objective being to detect outliers and determine if the prior model is falsified or not. The RMD for each data variable realization d or d obs was computed as follows: where ρ and β are the mean and covariance of the data d. Assuming the distribution of the data is multivariate Gaussian, the distribution of [RMD(dn)] 2 would be chi squared x 2 d . We set the 95th percentiles of x 2 d as the tolerance threshold for multivariate dimensional point d n . If RMD (d obs ) fell outside of the tolerance threshold (RMD (d obs ) > RMD (d n ), the d obs would be regarded as outliers, and the prior model would be falsified, as it has a very small probability. It should also be noted that this method requires data distribution to be Gaussian; if it is not, other outlier detection techniques such as isolation forest [27], local outliers detection [28], and one-class support vector machines [29] are highly recommended.

Dimensional Reduction
A single reservoir model is represented by numerous grid blocks, each with unique reservoir properties, such as permeability, porosity, and net-to-gross. Accordingly, we construct a vector X containing the reservoir properties of all grid blocks. We also use multiple ensembles of realizations to account for geological uncertainties X ∈ R N,m . Furthermore, typical ensembles are formed by hundreds of realizations, in that we are faced with a highdimensional problem. Geological realization with similar geological parameters trends will have similar production histories. As we aim to analyze the main geological distribution of the data, reducing the data dimensions is reasonable. Therefore, we utilize two different DR methods: t-SNE [11] and GPLVM [23,24], to characterize reservoir parameters efficiently by projecting the parameters into a 2D plane. However, t-SNE is a non-linear DR algorithm developed for exploring high-dimensional data. It maps multi-dimensional data to a twoor three-dimensional dataset that can be visualized in a scatter plot. Additionally, t-SNE learns joint probabilities defined by two points on a two-dimensional space to be as close as possible to conditional probabilities, defined by two points on high-dimensional space. For more details about t-SNE, one can refer to [11] and Appendix A. GPLVM differs from t-SNE, primarily because it is a Bayesian non-parametric DR method that uses Gaussian process to learn a low-dimensional representation of high-dimensional data. The main advantage of the GPLVM is that it allows the use of nonlinear covariance functions, i.e., that it can represent non-linear functions from the latent space to the data space. The probabilistic nature of the GPLVM also gives it advantages in dealing with missing data values. For more details about GPLVM, one can refer to [23,24] and Appendix A.

Clustering K-Means:
K-means clustering is an unsupervised learning method that is widely used because of its efficiency and simplicity. K-means is used to find the cluster configuration that minimizes the square error over all K clusters [30]: with u k as the centroid of the cluster, and c k refers to the mean of a point within the cluster, |S k | is the number of samples in the cluster c k . K-means clusters provide an optimal solution by minimizing of the distances between data and their centroids. The centroid is computed by the average of the data in each cluster. Several methods exist that allow the selection of the cluster sizes, including the gap statistics, elbow-method as well as the silhouette-method. In this study, we use the silhouette-method to determine the optimal number of clusters. The silhouette index varies between −1 and 1, where a value close to 1 means that the data is appropriate within its cluster. For all of the data-points, the silhouette value s(i) can be determined with the following equation: where a(i) represents average distances within a specific cluster and b(i) is the minimum average distance from data in other separate clusters. Specifically, a(i) shows how the i-th data is grouped within its cluster, and b(i) indicates the closest distance of adjacent clusters. Therefore, if a(i) is small, that means that the data are well grouped; however, if the silhouette value is close to 1, the b(i) is large.

ES-MDA and the Localization Technique
Opposite to the production forecast where the "unknown" reservoir behavior is predicted by using the "known" reservoir model variables, history matching inverses the process and estimates the "unknown" reservoir model variables with the "known" observed reservoir behavior. The general objective function for history matching is where g(m) is the simulated data with model variables m composed of reservoir variables (e.g., permeability, facies, porosity, and net to gross), and d obs is the observed data. The goal of history matching is to minimize the objective function (m) by finding acceptable model variables m and minJ(m). ES-MDA is an ensemble-based method introduced by Emerick and Reynolds [12]. In its simplest form, the method employs a standard smoother analysis equation a pre-defined number of times, with the covariance matrix of the measured data error multiplied by coefficient a. The coefficients must be selected such that the following condition is satisfied: where N a is the number of times the analysis step is repeated. The ES-MDA analysis applied to a vector of model parameters, m, can be written as Here, i is defined as the ith ensemble members; m a i is defined as an updated uncertainty vector, m b i , the initial or previous uncertainty vector; K, the Kalman gain matrix, which is used to compute by regularizing with SVD using 99.9 % of all the energy; d sim,i refers to simulation data obtained from previous models. Ensemble-based HM updates N reservoir models simultaneously. In addition, the Kalman gain matrix can be determined as follows: C md is the cross-covariance matrix between the vector of model parameters m and predicted data d; C dd is the auto covariance matrix of predicted data d; a p is the coefficient to inflate C D , which refers the covariance matrix of the observed data measurement error.
However, there are still conceptual and computational challenges associated with ES-MDA, one issue is that of ensemble collapse, which may result in unrealistic uncertainty and difficulty to cover the target distribution. To avoid this, a localization technique is implemented in the equation by introducing a correlation matrix R via an element-by-element multiplication, also known as Schur product (•). There are different ways of computing R. One of the most common approaches is the distance-dependent localization [31], in which all data points (oil rate, water rate) and model variables (permeability, porosity) are presumed to have certain physical locations connection. The ES-MDA equation is updated to: The parameter R is assumed to be from 0 to 1 depending on the distances for well locations [32]: where h is the Euclidean distance between a specific grid cell and well location, and L refers to the critical length, corresponding to influential regions for every well data. Therefore, a high value of R means that grid blocks are close to the wells.

General Setup
We tested the performance of the proposed methodology in the Brugge field case study. The Brugge field is a complex oilfield constructed by TNO [33]. The model consists of nine layers, and each layer has 139 × 48 gridblocks. The total number of gridblocks is 60,048, with 44,550 active cells. There are 20 producers and 10 injectors in the reservoir models.The reservoir is being depleted by voidage replacement. The producers and injectors are "smart wells", i.e., with vertical flow control, with three perforation intervals per well. For each producer well, the fluid rate is set to max value of 3000 bbl/day, and flowing bottom hole pressure superior to 50 Bar. For each injector well, the fluid rate is set to a max value of 4000 bbl/day, and the corresponding flowing bottom hole pressure less than 180 Bar. We used 104 initial geostatistical realizations provided by TNO and assumed one of 104 as a reference model. In the HM analysis, oil production rates (OPR), water cuts (WCT), and the bottom hole pressure (BHP) were considered, and the model variables to be updated included permeability (PERMX, PERMY, and PERMZ), porosity, and NTG in all active cells. Figure 2 shows the log permeability in the first layer for six random realizations. For more information about the Brugge benchmark, see [33].

ES-MDA with DR
The data assimilations were conducted using ES-MDA with localization, with Na = 5. Besides the ES-MDA-t-SNE and ES-MDA-GPLVM, we performed the HM on assimilation observation data for the first 4 years and the rest (6 years) for forecasting.
To assess the quality of prior models, a field oil production rate of 103 prior models was used with d obs by applying the RMD outlier detection. The RMD of d obs was found to be 8.802, which is below the 95th percentile threshold. This means that the prior is not incorrect. Figure 3 shows a comparison between RMD with d obs and RMD with 103 prior models. We applied t-SNE and GPLVM to reservoir models and reduced the dimension into 2D space. Additionally, the silhouette method is used to find optimal cluster numbers (ranges of 2 to 7 clusters) for both GPLVM and t-SNE 2D space. As displayed in Figure 4, the dashed line is used to denote the average silhouette, the silhouette plot with 2 clusters showed the highest value. Therefore, each model was divided into 2 clusters. Figures 5 and 6 show a scatter plot of the 103 models on a 2D plane, with each dot indicating individual models. When selecting the cluster with the least production error and comparing the forecast accuracy of different forecasting methods among several data sets, there are many performance measures from which to select. In this study, we chose to evaluate, for our forecasting results, a probabilistic metric called the mean continuous ranked probability score (CRPS). The mean CRPS quantifies both accuracy and precision [34], and higher values of the CRPS indicate less accurate results. The mathematical formulations of the mean CRPS are listed in Appendix A. We compared the field oil production rate (FOPR) errors between each cluster and reference model and selected the cluster with the least production error for the data assimilation process, as displayed in Table 1. Only 46 models were selected using t-SNE and 44 models using GPLVM. In Figure 7, we compare the average permeability values between initial 103 models and the selected 46 and 44 models using t-SNE and GPLVM, respectively. Both selected models with t-SNE and GPLVM have a quite similar distribution and quite similar selected reservoir models, and differences were found only in two models.      The total simulation for each method is listed in Table 2. We can see that ES-MDA uses around 220 min for the entire process, while ES-MDA-t-SNE and ES-MDA-GPLVM use around 120 and 101.5, respectively. By employing reduction techniques, more than 45% of the total simulation time was saved.      For a quantitative comparison, we applied the mean CRPS metric to further evaluate the methods used at all simulated well data from the history-matched ensembles over the historical and prediction period, as displayed in Figure 12. The ES-MDA-t-SNE and ES-MDA-GPLVM provide interesting results, with the lowest CRPS average compared to the prior model, and although we used few ensembles models and saved around 45-53% of the simulation time, the results seem to be comparable to the standard ES-MDA. Since the uncertainty ranges are quantified in all wells at the three cases, we conducted another comparison by boxplots of the normalized field cumulative oil and water production predicted by each method, as displayed in Figure 13. It should be noted that the values are normalized to the cumulative production from the reference case, which is also added for comparison. Both ES-MDA-t-SNE and ES-MDA-GPLVM cover the true values for both oil and water production in the box ranges.

Effect of Different "Reference" Models
The previous section evaluated the ES-MDA-DR procedure on a single 'reference' model. We now evaluate the methodology on five additional 'referred' models: Test Case 1, Test Case 2, Test Case 3, Test Case 4, and Test Case 5 (we reiterate that neither reference model was included in the set of N = 103 prior models). Similarly, the data assimilations were conducted using ES-MDA with localization. Apart from the ES-MDA-t-SNE and ES-MDA-GPLVM, we performed the HM on assimilation observation data for the first 4 years and the rest (6 years) for forecasting. Note that the reference value varies considerably between the test cases, as shown in Figure 14. The cumulative distribution function (CDF) for each test case is quite different, except the test case-1 and case-3, which show some similarities. Table 3 demonstrates the posterior normalized field cumulative oil and water production predicted by each method in terms of P10 and P90 statistics at 10 years. For four cases, the ES-MDA, ES-MDA-t-SNE, and ES-MDA-GPLVM, predictions surround the reference data within the the P10 to P90 range and showed a narrower range of uncertain prediction results. However, in Test Case 2, the cumulative oil production is biased in comparison with the reference production, which is likely explained by whether the problems with the prior ensemble or the selected reference model. The ensemble means distribution of the reconstructed updated permeability posterior for the top-layer for all five cases, as shown in Figure 15. The results suggest a slight change in the posterior model where the wells are located. Additionally, one can expect uncertainty reductions due to the conditioning of the posterior samples to dynamic data variables of wells that are contained within the forecast domain. The results exhibit quite similar posterior permeability distribution for both ES-MDA-t-SNE and ES-MDA-GPLVM. In sum, the results for the five different test cases imply that the DR procedure can indeed provide updated geological models and predictions with different reference models at a significantly reduced computation time. Figure 14. Empirical CDF computer from field oil production total for different test cases.

Effect of Reference Model Parameters Outside Prior Distribution
In the previous section, we presented the results of ES-MDA, ES-MDA-t-SNE, and ES-MDA-GPLVM for tests where every 'reference' model was within the prior distributions. This indicates that there is consistency between the prior realizations used in the three methods with the underlying 'reference model. Here, we aim to evaluate the performance of the three methods for cases that involve a reference model, which is not consistent with the prior realizations. More precisely, the reference model is characterized by parameters that fall outside the prior ranges. Figure 16 displays the permeability cumulative distribution function (CDF) between the generated reference model along with P10 and P90 prior ensemble. We observe that the 'reference' model permeability parameters for this example lie outside the range of the prior distributions. RMD outlier detection was used as displayed in Figure 17 to verify the prior uncertainty variables (field oil production) on the reference variable. The results show that the RMD of d obs falls above the 95th percentile threshold, in that the prior model is falsified.   Figures 18 and 19 depict the HM profiles for both oil and water cuts of three methods at producers BR-P5, BR-P6, and BR-P19, with respect to the prior ensemble and reference model. The results indicate that the standard ES-MDA and ES-MDA with DR failed to match the reference data, and the predictions from the referenced models do not lie within the predicted P10 to P90 percentile. Figure 20 displays a boxplot of field cumulative water and oil production obtained by each method and the prior ensemble. Overall, the results do not cover the reference values, which is clearly explained by the lack of representativeness of the prior realizations. The previous results evidently demonstrate the success of our procedure based on a degree of the quality in terms of the prior parameter ranges. We emphasize that it is crucial that the prior simulation results contain the observations. Otherwise, we would not expect ES-MDA with DR to provide reasonable posterior predictions, and in practice, we should adapt the prior ensemble before using it for model conditioning.

Concluding Remarks
In this study, we presented a novel history matching framework with DR while preserving realistic geology and matching the production data, which was achieved by explicitly integrating t-SNE, GPLVM, and clustering K-means with ESMDA to reduce the simulation time and quantify the uncertainty on reservoir models. The proposed procedure yielded reliable results by selecting a set of good prior ensemble reservoir models, with similar production performance to the reference model, before applying the data assimilation process. Accordingly, we compared the new implementation with the standard ES-MDA in a field reservoir problem with a large number of wells and a long production history. Based on the obtained results, the proposed ES-MDA with DR is concluded to be computationally faster than the original one, and it is very simple to implement and integrate with different types of data and models. We also evaluated our procedure with five different 'reference' models, where we observed that the ES-MDA with DR posterior predictions displayed considerably less uncertainty, and was indeed able to provide improved geological models and predictions at a significantly reduced computation time. Moreover, we also considered a test case where the reference model lay outside the prior distributions, but the results were clearly inconsistent and biased. In conclusion, the accuracy of both methods is highly relied on the ability and quality of the prior realizations to provide appropriate estimates of the prior uncertainty.
We recommend that further studies apply our procedures to more complex geological models such as bimodal channelized systems. This approach can be applied to examine and overcome the challenges in 4D seismic history matching as capturing the value of 4D seismic data can lead to better reservoir management decisions. It will also be interesting to introduce into the framework more non-linear DR techniques, such as a deep autoencoder, a stacked autoencoder, and a generative adversarial network. Additionally, combining the data-space inversion (DSI) method with ES-MDA may more accurately predict oil production with computationally faster simulation.  The objective function f is minimized using the following gradient descent: For N samples, the CRPS can be evaluated as follows: where p i = P(x) = i/N, f or x i < x < x i+1 (piecewise constant function)